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DESIGN OF POLYKETEDE SYNTHASE GENES 

FIELD OF INVENTION 

The present invention provides methods for the analysis of polyketides and the design of 
polyketide synthase genes. The invention relates to the fields of computational analysis, 
chemistry, molecular biology, and medicine. 

BACKGROUND OF THE INVENTION 

The class of compounds known as polyketides is a large family of diverse compounds 
synthesized primarily from 2-carbon unit building block compounds through a series of 
condensations and subsequent modifications. Polyketides occur in many types of organisms, 
including fungi and mycelial bacteria such as the actinomycetes. There are a wide variety of 
polyketide structures, and the class of polyketides encompasses numerous compounds with 
diverse activities. Epothilone, erythromycin, FK-506, FK-520, megalomicin, narbomycin, 
oleandomycin, picromycin, rapamycin, spinocyn, and tylosin are examples of such compounds. 

Given the difficulty in producing polyketide compounds by traditional chemical 
methodology, and the typically low production of polyketides in wild type cells, there as been 
considerable interest in finding improved or alternate means to produce polyketide compounds. 
See PCT Publication Nos. WO 95/08548; WO 96/40968; WO 97/02358; and 98/27203; Unites 
States Patent Nos. 5,962,290; 5,672,491; and 5,712,146; Fu et aL 9 Biochemistry 33: 9321-9326 
(1994); McDaniel etal, Science 262:1546-1555 (1993); and Rohr, Angew. Chem. Int. Ed. Engl. 
34(8): 88 1-888 (1995), each of which is incorporated herein by reference. 

Polyketides are synthesized in nature by polyketide synthase (PKS) enzymes. These 
enzymes, which are complexes of multiple large proteins, are similar to the synthases that 
catalyze condensation of 2-carbon unit building block compounds in the biosynthesis of fatty 
acids. The genes that encode PKS enzymes usually consist of three or more open reading 
frames (ORFs). Two major types of PKS enzymes are known that differ in their composition 
and mode of synthesis. These two major types of PKS enzymes are commonly referred to as 
Type I or 4, modular" and Type n or "iterative" PKS enzymes. 
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Modular PKSs produce many different polyketides, including a large number of 12-, 14-, 
and 16-membered macrolide antibiotics including erythromycin, megalomicin, methymycin, 
narbomycin, oleandomycin, picromycin, and tylosin. Each ORF of a modular PRS can 
comprise one, two, or more "modules" of ketosynthase activity, each module of which consists 
of at least two (if a loading module) and more typically three (for the simplest extender module) 
or more enzymatic activities or "domains." These large multifunctional enzymes (>300,000 
kDa) catalyze the biosynthesis of polyketide macrolactones through multistep pathways 
involving decarboxylative condensations between acyl thioesters followed by cycles of varying 
jS-carbon processing activities (see O'Hagan, D., The polyketide metabolites, E. Horwood, New 
York, 1 99 1 , which is incorporated herein by reference). 

During the past half decade, the study of modular PKS function and specificity has been 
greatly facilitated by the plasmid-based Streptomyces coelicolor expression system developed 
with the 6-deoxyerythronolide B (6-dEB) synthase (DEBS) genes (see Kao et al 9 Science, 265: 
509-512 (1994), McDaniel etal y Science 262: 1546-1557 (1993), and U.S. Patent Nos. 
5,672,491 and 5,712,146, each of which is incorporated herein by reference). The advantages to 
this plasmid-based genetic system for DEBS are that it overcomes the tedious and limited 
techniques for manipulating the natural DEBS host organism, Saccharopolyspora erythraea, 
allows more facile construction of recombinant PKSs, and reduces the complexity of PKS 
analysis by providing a "clean" host background. This system also expedited construction of a 
combinatorial modular polyketide library in Streptomyces (see PCT publication No. WO 
98/493 1 5, incorporated herein by reference). 

The ability to control aspects of polyketide biosynthesis, such as monomer selection and 
degree of jS-carbon processing, by genetic manipulation of PKSs has stimulated great interest in 
the combinatorial engineering of novel antibiotics (see Hutchinson, Curr. Opin. Microbiol 1: 
319-329 (199$); Carreras and Santi, Curr. Opvu Biotech. 9: 403-411 (1998); and U.S. Patent 
Nos. 5,962,290; 5,712,146; and 5,672,491, each of which is incorporated herein by reference). 
This interest has resulted in the clonin& analysis, and manipulation by recombinant DNA 
technology of genes that encode PKS enzymes. The resulting technology allows one to 
manipulate a known PKS gene cluster either to produce the polyketide synthesized by that PKS 
at higher levels than occur in nature or in hosts that otherwise do not produce the polyketide. 
The technology also allows one to produce molecules that are structurally related to, but distinct 
from, the polyketides produced from known PKS gene clusters. 
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Polyketides are assembled by polyketide synthases through successive condensations of 
activated coenzyme-A thioester monomers derived from small organic acids such as acetate, 
propionate, and butyrate. Active sites required for condensation include an acyltransferase 
(AT), acyl carrier protein (ACP), and beta-ketoacylsynthase (KS). Each condensation cycle 
results in a 0-keto group that undergoes all, some, or none of a series of processing activities. 
Active sites that perform these reactions include a ketoreductase (KR), dehydratase (DH), and 
enoylreductase (ER). Thus, the absence of any beta-keto processing domain results in the 
presence of a ketone, a KR alone gives rise to a hydroxyl, a KR and DH result in an alkene, 
while a KR, DH, and ER combination leads to complete reduction to an alkane. After assembly 
of the polyketide chain, the molecule typically undergoes cyclization(s) and post-PKS 
modification (e.g. glycosylation, oxidation, acylation) to achieve the final active compound. 

To illustrate the synthesis of a macrolide by a modular PKS (see Cane et al. 9 Science 
282: 63 (1998), incorporated herein by reference), one can refer to the PKS that produces the 
erythromycin polyketide (6-deoxyerythronolide B synthase or DEBS; see U.S. Patent No. 
5,824,513, incorporated herein by reference). In the modular DEBS PKS enzyme, the enzymatic 
steps for each round of condensation and reduction are encoded within a single ''module" of the 
polypeptide (i.e., one distinct module for every condensation cycle). As shown in Figure 1, 
DEBS consists of a loading module and 6 extender modules and a chain terminating thioesterase 
(TE) domain within three extremely large polypeptides encoded by three open reading frames 
(ORFs, designated eryAI, eryAH, and eryAIH). 

Each of the three polypeptide subunits of DEBS (DEB1, DEBS2, and DEBS3 in 
Figure 1) contains 2 extender modules. DEBS 1 additionally contains the loading module, and 
DEBS3 contains the TE domain. Collectively, these proteins catalyze the condensation and 
appropriate reduction of one propionyl Co A starter unit and six methylmalonyl CoA extender 
units. Modules 1 , 2, 5, and 6 contain KR domains; module 4 contains a complete set; 
KR/DH/ER, of reductive and dehydratase domains; and module 3 contains no functional 
reductive domain. Following the condensation and appropriate dehydration and reduction 
reactions, the enzyme bound intermediate is lactonized by the TE at the end of extender module 
6 to form 6-dEB (compound 1 in Figure 1). 



More particularly, the loading module of DEBS consists of two domains, an acyl- 
transferase (AT) domain and an acyl carrier protein (ACP) domain. In other PKS enzymes, the 
loading module is not composed of an AT and an ACP but instead utilizes a partially inactivated 
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KS, an AT, and an ACP. This partially inactivated KS is in most instances called KS Q , where 
the superscript letter is the abbreviation for the amino acid, glutamine, that is present instead of a 
cysteine in the active site that is believed to be required for condensation activity. Although the 
KS Q domain lacks condensation activity, it retains decarboxylase activity. The AT domain of 
the loading module recognizes a particular acyl-CoA (propionyl for DEBS, which can also 
accept acetyl) and transfers it as a thiol ester to the ACP of the loading module. Concurrently, 
the AT on each of the extender modules recognizes a particular extender-CoA (methylmalonyl 
for DEBS) and transfers it to the ACP of that module to form a thioester. Once the PKS is 
primed with acyl- and malonyl-ACPs, the acyl group of the loading module migrates to form a 
thiol ester (trans-esterification) at the KS of the first extender module; at this stage, extender 
module 1 possesses an acyl-KS and a methylmalonyl ACP. The acyl group derived from the 
loading module is then covalently attached to the alpha-carbon of the raalonyl group to form a 
carbon-carbon bond, driven by concomitant decarboxylation, and generating a new acyl-ACP 
that has a backbone two carbons longer than the loading unit (elongation or extension). The 
growing polyketide chain (various intermediates are shown in Figure 1) is transferred from the 
ACP to the KS of the next module, and the process continues. 

The polyketide chain, growing by two carbons each module, is sequentially passed as a 
covalently bound thiol ester from module to module, in an assembly line-like process. The 
carbon chain produced by this process alone would possess a ketone at every other carbon atom, 
producing a polyketone, from which the name polyketide arises. Commonly, however, 
additional enzymatic activities modify the beta keto group of the polyketide chain to which the 
two carbon unit has been added before the chain is transferred to the next module. Modules may 
contain additional enzymatic activities as well, such as methyl transferase domains, but there are 
no such additional activities in DEBS. 

Once a polyketide chain traverses the final extender module of a modular PKS, it 
encounters the releasing domain or thioesterase found at the carboxyl end of most PKSs. Here, 
the polyketide is cleaved from the enzyme and cyclyzed. The resulting polyketide can be 
modified further by tailoring or modification enzymes; these enzymes add carbohydrate groups 
or methyl groups, or make other modifications, i.e., oxidation or reduction, cm the polyketide 
core molecule. For example, the final steps in conversion of 6-dEB to erythromycin A include 
the actions of a number of modification enzymes, such as:*C-6 hydroxylation, attachment of 
mycarose and desosamine sugars, C-12 hydroxylation (which produces erythromycin C), and 



WO 01/92991 PCT/US01/17352 

5 

conversion of mycarose to cladinose via O-methylation. These modifications in various 
combinations result in erythromycins A (compound 2 in Figure 1), B, C, and D. 

While the detailed understanding of the mechanisms by which PKS enzymes function 
and the development of methods for manipulating PKS genes have facilitated the creation of 
novel polyketides, there remain substantial impediments to the creation of novel polyketides by 
genetic engineering. One such impediment is the availability of PKS genes. Many polyketides 
are known but only a relatively small portion of the corresponding PKS genes have been cloned 
and are available for manipulation. Moreover, in many instances the producing organism for an 
interesting polyketide is obtainable only with great difficulty and expense, and techniques for its 
growth in the laboratory and production of the polyketide it produces are unknown or difficult or 
time-consuming to practice. Also, even if the PKS genes for a desired polyketide have been 
cloned, those genes may not serve to drive the level of production desired in a particular host 
cell. 

If there were a method to produce a desired polyketide without having to access the 
genes that encode the PKS that produces the polyketide, then many of these difficulties could be 
ameliorated or avoided altogether. The present invention meets this need. 



SUMMARY OF THE INVENTION 

In one embodiment, the present invention provides methods for the computational 
analysis of polyketides and the computer-assisted design of PKS genes. 

In a first aspect, the present invention provides a method for representing the structure of 
a polyketide and/or a PKS gene that encodes the PKS that produces the polyketide by 
alphanumeric symbols that facilitates computer assisted analysis. 

In a second aspect, the present invention provides a database of polyketides and 
corresponding PKS genes that can be rapidly searched and information extracted for a variety of 
applications. More particularly, this database can include, in one mode, all known polyketides; 
and in another mode, the polyketides, optionally including all intermediates, produced by all 
known PKS genes or a subset thereof. 
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In a third aspect, the present invention provides a method for predicting the structure of a 
PKS and its corresponding genes from the structure of a polyketide. 

En a fourth aspect, the present invention provides a method for designing novel PKS 
• genes capable of producing a desired polyketide. This aspect of the invention is directed to the 
design and specification of PKS genes via the recombining of modules or portions of modules or 
sets of modules from already known and available PKS genes. In one mode, all possible PKS 
genes encoding a desired polyketide from a set of genes in a database are generated. In another 
mode, only a subset of such possible PKS genes is generated based on one or more parameters 
selected by the user. More particularly, a rating system is provided to sort the PKS genes 
designed for a particular target polyketide based on any one or more of several criteria, including 
number of non-native module interfaces, number of non-native protein interfaces, and other 
parameters as more particularly described below or selected by the user. 

In another embodiment, the present invention provides methods and reagents for 
preparing novel PKS genes that encode PKS enzymes that produce a desired polyketide. 

In a first aspect, the present invention provides a library of recombinant DNA 
compounds, wherein each member of said library encodes a module of a PKS or portions of 
modules or sets of modules having a desired specificity, and the library as a whole encompasses 
all of the members of a desired class of specificities. 

In a second aspect, the present invention provides a method for assembling a PKS gene 
cluster that encodes a PKS that produces a desired polyketide from known and available PKS 
genes other than the naturally occurring PKS genes that produce the polyketide in nature. 

• 

These and other embodiments, modes, and aspects of the invention are described in more 
detail in the following description, the examples, and claims set forth below. 

BRIEF DESCRIPTION OF THE FIGURES 

Figure 1 shows a schematic representation of the PKS enzyme that synthesizes 6- 
deoxyerythronolide B (6-dEB, compound 1). The PKS is composed of three proteins, DEBS1, 
DEBS2, and DEBS3, each of which is represented by an arrow and contains two or more 
modules. Bach module is represented by a solid line, and the domains in each module are shown 
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inside the arrow. Various intermediates produced during the synthesis are also shown, as are the 
structures of erythromycins A (compound 2), B, and D resulting from modification of 6-dEB. 

Figure 2 shows an illustrative set of 2-carbon unit monomers present in macrocyclic 
polyketides; these monomers can be used to represent polyketide backbone diversity generated 
by commonly used starter and extender units (malonyl Co A and methylmalonyl Co A) and the 
condensation and reduction reactions mediated by PKS enzymes. 

Figure 3 shows a representation of 6-dEB by molecular graph, CHUCKLES notation, 
and SMILES notation. The CHUCKLES notation uses the 2-carbon unit monomers shown in 
Figure 2, „ In the CHUCKLES notation, the order of attachment of monomers is designated by. 
the order in which monomers are listed, and the attachment points within the monomers are 
specified in their definitions. In the SMILES notation, adjacent monomers are attached via 
single (covalent) bonds depicted by dashes. The cyclization bond is represented by the index 1 
adj acent to the Start and Close monomers. 

Figure 4 is a flowchart and block flow diagram in five parts designated A-E, inclusive. 

Flowchart Figure 4A is a block flow diagram of a computer system to design a novel 
PKS (and corresponding genes). > 

Flowchart Figure 4B is a block flow diagram wherein the "Computer Program" block (2) 
of Flowchart Figure 4A is further defined 

Flowchart Figure 4C is a block flow diagram wherein the 'Design novel hybrid PKS 
genes from library for TARGET" block of Flowchart Figure 4B is further defined 

Flowchart Figure 4D is a block flow diagram wherein the "align TARGET with 
STARTER; copy to ALIGNMENT* block of Flowchart Figure 4C is further defined 

Flowchart Figure 4E is a block flow diagram wherein the "Rate novel hybrid designs" 
block (3) of Flowchart Figure 4B is further defined 

Figure 5 shows a flowchart of a matching method for the generation of the CHUCKLES 
strings used for all polyketides in a library. 
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DETAILED DESCRIPTION OF THE INVENTION 

Because polyketides synthesized by modular PES genes are built by the enzymatically 
controlled addition of primarily 2-carbon unit monomers and, to a lesser extent, other more 
complex monomers, each polyketide may be represented as a string of 2-carbon unit and other 
monomers. These monomers represent the portion of the polyketide backbone structure as a 
result of the incorporation of various starter and extender units (malonate, methyl malonate, etc.) 
and the subsequent chemical reactions. 

These reactions include: 

(1) condensation reactions, of which there are three basic reactions: malonyl-CoA 
condensation arid methylmalonyl-CoA condensation with the branched methyl having either R 
or S stereochemistry; and 

(2) reduction reactions, of which there are five basic reactions: no reduction (ketone 
preserved), keto-reduction (to yield a hydroxyl having either R or S stereochemistry), 
dehydration (trans double bond), and enoyl-reduction (to yield a methylene). 

An illustrative set of the basic monomers that can be used to represent a polyketide 
structure (and their corresponding symbols) comprises: 




o 

JL^=l; ^Y=j. "Y=K; ^/ = L . 
'^Y'-M; and /V / = N . 
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A miscellaneous monomer, Q, can be used to denote a portion of the polyketide structure that 
cannot be assigned by monomers A-N. 

The monomer set shown above and in Figure 2 does not represent the actual monomers 
incorporated during biosynthesis. Instead, these monomers include a carbon from two different 
biosynthetic monomers. This is best explained using a polyketide fragment depicted below. 




CH3 CH3 



The fragment includes two two-carbon units, i and i+1 and part of a third two-carbon unit, i-1 
that were incorporated into the polyketide during biosynthesis. The i-th extender module 
attaches the two carbon biosynthetic unit whose backbone carbons are designated as alphaj and 
betai and the second extender module attaches the two carbon biosynthetic unit whose backbone 
carbons are designated as alphas and betai+i. Using the monomer set shown above, this 
fragment consists of monomer A (derived from the beta carbon added in module i+1 and the 
alpha carbon added in module i) and another monomer A (derived from the beta carbon added in 
module i and the alpha carbon added in module i+1). 




CH3 CH3 

I I ! I 



A A 

The fifth carbon designated beta* 1 remains unassigned and will depend on the identity of the 
two-carbon biosynthetic unit that is incorporated in the polyketide by module i+2. 

The set of monomers shown in Figure 2 can be expanded to include other starter and 
extender units, of which there are many. Such starter and extender units include, for example 
but without limitation, hydroxymalonate (e.g., niddamycin), methoxymalonate (e.g. FK-520), 
ethylmalonate (e.g., FK-520), amino acids or amino acid derivatives that are incorporated into 
polyketides by the action of a non-ribosomal peptide synthase (e.g., thiazole in epothilone and 
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pipecolate in rapamycin), or othex units incorporated by, for example, an AMP ligase (e.g., the 
dihydroxycylohexyl moiety in rapamycin, FK-506, and FK-520) or a soluble CoA ligase. An 
illustrative set of additional starter and extender units includes: 




s~Y = J ' ; = K' ;aild = M' 

R ft R 

where R can be anything other than hydrogen or methyl (e.g., allyl, butyl, ethyl, hexyl, hydroxyl, 
isobutyl, and methoxy). 

The set of monomers can also include post-PKS modifications, such as hydroxylation, 
methylation, epoxidation, glycosylation, or addition of intra-macrocyclic fused rings making the 
system polycyclic. Also, a variety of methods are known for the incorporation of unusual starter 
and or extender units in polyketide synthases (see, e.g., PCT Publication Nos. WO 97/02358; 
WO 99/03986; WO 98/01546; and WO 98/01571, each of which is incorporated herein by 
reference, and the monomer set can include such units. 

By viewing polyketides as composed of sets of distinct monomers, one can in 
accordance with the present invention define a polyketide as a string of alpha-numeric symbols 
to facilitate computer analysis. In one method, a modified CHUCKLES methodology fin: 
representing polyketides is used. The CHUCKLES methodology (see Siani et ai 9 
"CHUCKLES: a method for representing and searching peptide and peptoid sequence," J. 
Chem. Inf. Sci. 34: 588-593 (1994) which is incorporated herein by reference) for representing 
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peptides and related oligomers allows monomers to be strung together such that the molecular 
graph for the basic macrocycle can be generated from the string of monomers. 

For example, using the set of monomers comprising A-N described above, the 
erythromycin macrocycle or 6-dBB can be represented as ADGJDD. This string of 
alphanumeric symbols is also referred to as the CHUCKLES string. Figure 3 depicts the 
relationship between the CHUCKLES string, the SMILES string, and the actual molecular 
structure of 6-dEB. The CHUCKLES string for 6-dEB can be annotated to represent the 
structure of erythromycin A: A(l-lactone closure^-hydroxyl)DGJ(2-hydroxyl)D(l-glycosyl) D(l- 
glycosyl). Thus, ring closure (cyclization) and post-synthetic modifications (glycosylation and 
hydroxylation), and non-standard units where applicable (there are none in 6-dEB and 
erythromycin) are entered between parentheses after each monomer. Another example is an 
annotated CHUCKLES string for epothilone B: ME(l-lactone-closure)M(qpoxide)LJDG(2- 
methylation)E. As above, cyclization, post-synthetic modifications (epoxide formation), and 
non-standard units (methyl at C-4) are entered between parentheses after each monomer. 

In another aspect of the present invention, a database of polyketides is provided. In one 
aspect of the present invention, the polyketides are represented by a string of defined monomers. 
In one embodiment, the monomers are selected from a group consisting of: 




JL/=i; -"Y=J; ""Y =l <; /v/ = L; 

-^Y" = M; and /V^n. 
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In another embodiment, polyketides are represented by the monomers A-N as well as 
additional monomers selected from the group consisting of 




r A A 

where R can be anything other than hydrogen or methyl. 

The string of monomers can be represented as a linearized structure or as a string of 
symbols. For example, the erythromycin can be represented as its aglycone, 6-dEB, as 

or as a string of symbols, ADGJDD. Optionally, the string of symbols can be annotated as "A(l- 
lactone closure,2-hydroxyl)DGJ(2-hydroxyl)D(l-glycosyl) D(l-glycosyl)" to more fully capture 
the erythromycin structure. This set of annotated strings is referred to as a "coded library" or a 
"coded" database of the present invention. 

In an illustrative embodiment, the polyketide database consists of the polyketides 
described in current literature (Journal of Antibiotics (1981-present), Journal of Natural 
Products) and various databases (Chemical Abstracts CAPlus, AntiBase). All unique 
macrocyclic polyketides are converted to the modified CHUCKLES format Of the -1000 novel 
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polyketides obtained, only -200 different strings of monomers and unique macrocycles are 
needed to represent the much larger collection of polyketides in the database, because many of 
the differences between the naturally-occurring polyketides are due to different glycosyl (sugar) 
groups attached at different positions on the macrocycle. 

Thus, a macrocyclic polyketide can be converted to a string of 2-carbon monomers by 
mapping the monomers onto the polyketide. This can be performed manually or with computer 
assistance. First, any sugar moieties are conceptually removed by hydrolysis and any lactones 
(bond between the ketone and oxygen) are hydrolyzed thus generating a linearized structure of 
the backbone of the polyketide. Generally, this leaves a carboxy carbon at one end of the linear 
molecule and a hydroxyl at the other. The polyketide is then "sequenced" manually or in silico 
from the end containing the carboxy carbon, the end corresponding to the last monomer added 
by the PKS before synthesis is complete. This end serves as a convenient handle from which to 
start the mapping process. Although closing of the lactone often occurs between the two ends of 
the polyketide, this is not always the case. However, the last ketone added by the PKS is almost 
always involved in macrolactone formation and so serves as a more convenient handle than the 
hydroxyl for commencing sequencing. 

The manual or in-silico sequencing is performed by matching the monomers, one at a 
time, while traversing the macrocyclic backbone. First the carboxy carbon is skipped, and an 
attempt is made to match each of the monomers in the monomer set selected (i.e., monomer set 
A-N in Figure 2) against the next two carbons in the macrocycle. The match takes into account 
carbon, oxygen, and no substitution at each backbone position, chirality at each backbone 
position, and bond order between the two backbone carbons. 

If the sequencing is performed in silico, the method is referred to as back-translation and 
involves converting a molecular graph into a string of monomers. First, the monomer library is 
converted to SMARTS format. SMARTS is a superset of the SMILES language that specifies a 
pattern in a molecular graph (Daylight Software Manual: Theory, Daylight Chemical 
Information Systems; Irvine, CA 1993, incorporated herein by reference). SMARTS permits 
one to specify a variable number or a limit on the number of covalent bonds to non-hydrogen 
atoms from a particular atom. In contrast, SMILES assumes that the unspecified valences are 
hydrogens. For example, the SMILES string for monomer A is [C@@H](0)[C@H](C). The 
oxygen may be bonded to any other single atom; if the atom is not specified, it is assumed be a 
hydrogen. In the SMARTS string for monomer A, [C@OT(0;D2])[C@H]([CH3]), one can 
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specify the exact number of hydrogens on some atoms (e.g., "CH3"). In addition, the "[0;D2]" 
indicates the oxygen is bonded to two (from D2) non-hydrogen atoms, in this case the first 
caibon and some other unspecified atom. This allows matching and distinction of post- 
modification moieties attached to the oxygen as well as additional cychzations (six member 
rings can occur within the macrocycle; e.g., rapamcyin). Thus, the SMARTS notation allows 
pattern matching against the polyketide molecular graph. 

When a match occurs, the atoms that match are tagged as part of a superset and labeled 
with the monomer name. Any atoms that are connected to the monomer that are not part of the 
macrocycle are tagged for identification as special precursor units (e.g., ethyimalonate instead of 
methyl malonate or malonate), or post-synthetic modification moieties (e.g., sugars, CCHO, 
hydroxylation, methylation). If all the atoms and bonds of the monomer cannot be identified, 
the monomer is given a designation to indicate the lack of identification (e.g., Q for question 
mark). These Q monomers can be used to identify monomers that are the site of post-PKS 
modifications that mask the function of the PKS module that generated that portion of the 
polyketide or that are not in the monomer set and so prevent the correlation of a particular 
segment of the backbone with one of the monomers in the monomer set. 

After a particular 2-carbon unit is identified, the next two carbons are processed the same 
way. This is repeated until all the backbone carbons are identified and labeled as monomers. 
When all two-carbon units are identified, one has generated an ordered sequence, or string, of 
monomers, which is a modified CHUCKLES string of the invention. Moieties corresponding to 
post-PKS modifications are appended to the monomer in the string as an annotation in 
parentheses. This method of sequencing may "be extended to include any type of monomer. 
Figure 5 shows a flow chart of this matching method for the generation of the CHUCKLES 
string$ used for all polyketides in a library. 

The CHUCKLES string can be in the order corresponding to the direction of 
biosynthesis on the PKS or its reverse. Each CHUCKLES string has a one-to-many relationship 
with the PKS gene in the producing organism. Thus, while many different organisms can 
produce the same polyketide using the same or different PKS genes, each PKS gene generally 
produces only one PKS that produces only one polyketide (some AT domains can bind different 
CoAs, leading to the production of multiple polyketides from a single PKS). This allows one to 
design, from the polyketide structure, a set of PKS genes that would produce that polyketide. 
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Thus, the present invention provides methods and computational analysis tools for 
designing PKS genes to produce a desired polyketide. As an illustrative example, the present 
invention provides a computer program termed MORPH (see the Examples below) that can read 
the coded library (see the Examples below). An illustrative coded library consists of -200 
unique polyketide CHUCKLES strings. The user specifies the target polyketide, which is 
converted from molecular structure to a CHUCKLES string. 

The program then performs the following, starting with each library compound or string: 

(1) aligns library compound and target compound, emphasizing alignment of adjacent 
monomers common between the two; 

(2) fills in the gaps using all possible combinations from all library members; 
• (3) counts number of non-natural inter-modular boundaries, 

(4) outputs all these alignments. 
The alignments are then sorted based on the number of non-natural inter-modular 
boundaries. 

This illustrative program allows one to design and find PKS genes that encode PKS 
enzymes that are combinations of two or more different PKS enzymes with the fewest inter- 
modular boundaries, and optionally the fewest inter-protein boundaries. Many other alternative 
embodiments are provided by the present invention. 

For example, one can include the naturally occurring PKS that produces the target 
polyketide in the coded library to allow components of that PKS to be incorporated into the 
design of a new PKS. Also, one can include in the coded library non-naturally occurring PKS 
enzymes, such as those produced and published in the scientific and patent literature to make 
novel polyketides, in the coded library. See, e.g., PCT publication Nos. WO 98/493 15 and WO 
96/40968, both of which are incorporated herein by reference. 

This CHUCKLES-coded polyketide library can be stored in a computer file as a set of 
records. In one embodiment, each record contains the chemical name of the polyketide, the 
unannotated CHUCKLES (containing basic macrocyclic monomers), the annotated 
CHUCKLES (containing basic macrocyclic monomers with information about post-PKS 
modifications), the producing organism(s), and other information (e.g., linearized representation 
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of the polyketide structure, the accession number of organisms or plasmids that have been 
deported, gene sequence information, and references). 

The MORPH program can read in the polyketide library entries to an array or list of dat 
structures, where each entry data structure contains all or a selected subset of the fields in each 
library record. The MORPH program then reads in the CHUCKLES-coded TARGET 
polyketide from the user. This TARGET may optionally be blocked from the library so that it I 
not used as a STARTER or left in the library, i.e., if it is only distantly related to other known 
polyketides, or some modules could be useful in designing novel PKS genes, or it is desired to 
replace only certain PKS modules. This program could also be used for analoging at a particul 
position via wild-cards defined as part of the TARGET sequence by the user. 

Each member of the coded PKS library can be selected as a STARTER unit. Thus, 
during a run, all library members can be given an equal chance as STARTER units. After a' 
STARTER is chosen, the TARGET is aligned with it. See Flowchart Figure 4D. Any method 
of alignment can be used such that the maximal number of adjacent STARTER modules is use< 
in the final alignment After the maximal adjacent modules are used in the ALIGNMENT, 
smaller adjacent sets or individual modules from the STARTER are used to fill in the gaps. 
There may be several alignments that are equally good based on the attempt to optimize the 
number of adjacent modules. For example, if the TARGET contains the "JDG" substring, then 
6-dEB, identified with the A1D2G3J4D5D6 CHUCKLES string, may align as 



TARGET 


1 JDG 


6-dEB 


| J4D2G3 


TARGET 1 


JDG 


6-dEB 


| J4D5G3. 



Both of these alignments have different maximal adjacent modules, with the 
same length oftwo(D2G3 in the first and J4D5 in the second). Accordingly, either 
alignment could be used as STARTERs. 

With the optimized alignment from the STARTER, other library entries are 
systematically used to complete the alignment, or fill in the gaps. This part may be performed 
on either the optimized ALIGNMENT described above, or the ALIGNMENT without the singl 
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modules from the STARTER; the removal of the individual modules opens up more space into 
which larger pieces of the FILLER might be placed. The first library entry is designated as the 
FILLER. If the FILLER is the same as the STARTER, the next library entry is used as the 
FILLER. This library entry is flagged as the CmRENTJFILLER^LroRARY^ENTRy. The 
same method for finding maximally adjacent modules and then smaller sets or single modules is 
used to fill the gaps in ALIGNMENT from the FILLER. If not all the gaps are filled in the 
ALIGNMENT, then the next library entry is used as a new source; that is, it is designated as the 
FILLER, and the gaps are filled further. This is repeated until the ALIGNMENT is complete or 
the end of the library is reached. 

Assuming all modules in the TARGET are represented in the library, the ALIGNMENT • 
is eventually completely filled The completed alignment is then written to an output file on the 
computer disk. When the ALIGNMENT is complete, or there are no more FILLERS in the 
library, the TARGET and STARTER alignment are re-copied to ALIGNMENT. The 
CURRENT_FILLER_LIBRARY_ENTRY is incremented, and a new attempt to fill in the gaps 
is started. 

When the CURRENT_FILLER_LIBRARY_ENTRY has reached the end of the library, 
the ALIGNMENT is wiped, and a new STARTER is chosen. The above process is then 
repeated for the next STARTER. When all library entries have been used as starters, then all 
feasible novel polyketide synthases have been generated and written to the computer file. The 
novel PKSs are then read back into memory and can be further evaluated. An illustrative 
evaluation process involves: 

(1) counting the non-native inter-module interfaces, and 

(2) counting the number of native inter-protein interfaces (for known and 
annotated gene sequences). 

The novel PKSs are then sorted based on these two numbers, giving higher priority to the non- 
native inter-module interfaces. In this mode, the goal is to identify those novel PKSs that contain 
the fewest non-native interfaces. 

1 By providing methods and means for the computer-assisted analysis of polyketides and 
PKS genes, the present invention greatly facilitates the identification and production of new 
polyketides with useful activities. Those of skill in the art will appreciate that while the 
invention is in part illustrated in the Examples below with respect to the design of new PKS 
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genes for known polyketides, the invention can also be used to design PKS genes for novel 
polyketides. In this embodiment, one simply provides the structure of the novel polyketide to 
the MORPH or other program of the invention to generate the desired PKS genes. 

Moreover, while the invention is exemplified below by designing new PKS genes 
composed of the coding sequences for one or more complete modules of two or more different 
PKS genes, partial modules can also be employed. With the appropriate choice of monomer sets 
and corresponding coding of the library to be searched, one can generate new PKS gene designs 
that take advantage of the potential to fuse one PKS gene coding sequence to another at a site 
corresponding to an intra-modular junction. In another embodiment, one can use 'Svild-cards" 
in the encoded polyketide or library to take advantage of known or predicted SAR. Thus, if one 
knows that a particular position in a polyketide can be varied (i.e., a hydrogen, methyl, or ethyl 
group at a location determined by an AT domain of a particular module, or a hydroxyl or keto 
group at a location determined by the presence or absence of a KR domain in a particular 
module) then one can use a wild-card monomer designation in the polyketide CHUCKLES 
string to generate PKS genes that produce each of the desired variants. 

The methods of the invention have diverse application in addition to the design of new 
PKS genes. As but one illustrative example, the methods of the invention can be used to design 
methods to produce a desired compound. Organic molecules containing stereochemical centers 
are useful for a number of purposes, including use as synthetic or semi-synthetic intermediates. 
The preparation of such intermediates by organic synthesis can be extremely time consuming 
and expensive. An alternative source of such intermediates is via specific degradation of a 
polyketide, and the present invention provides computer-assisted means for designing such 
production methods. 

Thus, certain functional groups of polyketides are susceptible to bond cleavage by 
specific chemical reactions that do not affect other functional groups. For example, carbon- 
carbon double bonds can be specifically cleaved by permanganate without affecting other 
functional groups normally in polyketides, such as ketones, alcohols, and lactones. likewise, 
the Baeyer-Villager reaction converts a ketone to an ester (lactone) without affecting other 
groups of the aglycone. In accordance with the methods of the invention, one can assemble a 
library of polyketides in a database that can be addressed with a query describing a particular 
chemical reaction to generate all of the degradation products produced by that reaction upon 
each of the polyketides in the library. The degradation fragments thus generated serve as a 



WO 01/92991 



19 



PCT/US01/17352 



library of the invention that can be sorted by properties, such as size, number and type of 
stereochemical centers, functional groups, or other factors, and searched for useful compounds. 
Moreover, the functional groups on the ends of the fragments generated (or at other locations) 
can also be converted to other functioyial groups by chemical reactions (optionally employing 
protecting groups on other functional groups), and the database of compounds can be expanded 
to include the compounds produced by such reactions. 

From even a modest library of -200 compounds, one can in this manner generate using 
the methods of the invention, two to three times as many valuable chemical intermediates. Once 
such an intermediate is identified, the organism that produces the polyketide from which the 
fragment is derived is fermented, the polyketide isolated in bulk, the chemical reaction 
performed, and the desired degradation product(s) isolated and used. In this manner, the present 
invention makes available a wide variety of useful products otherwise unattainable. 

Thus, the present invention has wide application in the fields of chemistry, particularly 
medicinal chemistry, molecular biology, and medicine. Those of skill in the art will recognize 
these and other benefits and applications provided by the present invention. Thus, the following 
examples are given for the purpose of illustrating the present invention and shall not be 
construed as being a limitation on the scope of the invention or claims. 

EXAMPLE 1 

The MORPH Program 

This example provides the source code for an illustrative MORPH program of the 
invention. The MORPH program is a command line driven program that runs on a UNIX 
system. The program can be run from a shell script, such that the user fills in the entire 
command ahead of time, then post-processes the output file with UNIX utilities including sort, 
egrep, and uniq. 

The command line appears as follows: 

moiph3 -1 libraryfile -n targetname -t targetsequence [-x X-wildcards] [-y Y-wildcards] [-z Z- 
wildcards]. 
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The library file is the name of the text file described below in Example 2. The target 
name is a user-defined identifier to distinguish this target from the library members (e,g,, 
epothiloneD). The target sequence is a string of monomers that represent the CHUCKLES- 
encoded target polyketide (e.g., MEMUDGE). Generally, if the target sequence is in the 
library, it is commented out from the library so that the morph program does not find the target . 
itself. The three different wildcards, X, Y, and Z, are independent sets of monomers that can be 
included in the target sequence. 

The output from the morph program can be redirected to a file. This output file is then 
post-processed by (1) extracting the HIT lines with valid combinations of modules that yield the 
target, (2) sorting the HTTS based on alphanumeric content using the UNIX sort command, (3) 
running the UNIX uniq command which removes multiple copies of each HIT, leaving one copy 
of each, (4) sorting based on the number of pieces in the sequence of modules. Generally, the 
fewest number of pieces, which correspond to the fewest number of inter-modular interfaces, are 
desired; these will appear at the top of the output 

Below are some illustrative examples of calls to the MORPH program from a shell script 
using epothilone as a target. The first example generates combinations that yield epothilone D: 

%morph3 -1 PKS.lib -n epoD -t MEMUDGE > omorph3_epoD 

%egrep HIT omorph3_epoD | sort | uniq | sort +10 -1 1 > omorph3_epoD.uniq.sort 

The second example generates combinations that yield a derivative of epothilone D having a 

hydroxylatC-13: 

%morph3 -1 PKS.lib -n epoD-130H -t MEXLIDGE -x ABCD > oepoD-130H 
%egrep HIT oepoD-130H | sort | uniq | sort +10 -1 1 > oepoD-130H.uniq.sort 

The third example generates an epothilone having wildcards (set 1): 

%morph3 -1 PKS.lib -n epoD-setl -t MEXYZDgE -x ABCD -y LEFIN -z JACGM > 

oepoD-setl 

%grep HIT oepoD-setl | sort | uniq | sort +10 -1 1 > oepoD-setl.urriq.sort 
The fourth example generates an epothilone having another set of wildcards (set 2): 

%morph3 -1 PKS.lib -n epoD-set2 -t MEXYZDgE -x JK -y EF -z JACGM > oepoD-set2 
%grep HIT oepoD-set2 | sort | uniq | sort +10 -1 1 > oepoD-set2.uniq.sort 

MORPH in its current implementation operates at the monomer level and thus does not 
handle intra-modular modifications/spUtting. Future implementations could convert the 
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CHUCKLES-encoded strings into the corresponding and equivalent SMILES and then perform 
more complex chemical analysis of the PKS molecular graphs. Currently, inter-modular double 
bonds are present in the library, but are ignored by the program. These bonds can introduced 
post-biosynthetically and the exact source is generally unknown. 

The source code for MORPH is found in Appendix A (version 3.0) and B (version 4.0) 
(deposited in the microfiche appendix). 



EXAMPLE 2 

Illustrative Polvketide Library 

This example provides the contents of an illustrative CHUCLKES encoded polyketide 
library. The first column provides the name of the polyketide; the second the CHUCKLES 
string; the third the annotated CHUCKLES string; and the fourth the source organism. Entries 
• under annotated CHUCKLES and source organism are not complete for all of the polyketides. 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


3-acetyl-4"- 
butyltylosin 


FMNGODF 






#aculeximycin 


RRNRSMRSRSS 
SSRLSSN 


RR(2-ethyl)NRSMRS(l- 

glycosyl)RSSS(24iydroxyl)SR(l- 

glycosyl)LSSN(2-ethyl) 




albocycline-Ml- 

ingramycin- 

TA2407- 

cineromycinB- 

U28010-SR2077 


BLME=JN 


BLM(1 ,2-epoxy)E( 1 -methoxy)=J(2- 
hydroxyl)N 




albocycline-M2 


BLME=JN 


B(2-hydroxyl)LME(l-methoxy)=J(2- 
hydroxyl)N 




albocycline-M3 


BSMEf=JL 


BSME(1 -methoxy)=J(2-hydroxyl)L 




albocycline-M5 


BSME=JN 


BSME(1 -methoxy)=J(2-hydroxyl)N 




albocycline-M6 


BLME=JL 


BL(2-hydroxyl)ME(l-inethoxy>=J(2- 
hydroxyl)L 




albocycline-M7 


BLME=JN 


BL(2-hydroxyl)ME(l -methoxy)=J(2- 
hydroxyl)N 




albocycline-M8 


BLME=Q 


BLME( 1 -methox?>=Q 




aldgamycin 


BMLGJDL 


B(l-cyc)MLG(2-liydroxyl)JDL(2. 
eye) 




amphotericinA 


CDNNLNNNNF 
CEFEALEE 


CX)NNI^NNNF(l-glycosyl)C(l-0- 

cyc,2-carboxylicacid)EF(l- 

cyc)EALEE 




#amphotericinB 


CDNNNNNNNF 
OOOEELEE 


(^NNNNNNNFn -plvcosvnrn -O- 
eve 2-carboxvlicacid^EFf 1- 
cvc)EALEE 




angiolam 


NMFJNSJIQLGA 
m 






aplyronineA 


BFJCENFFMEK 
AFNN 


B(1-C(=0)QF(1- 
C(K))C(QN(C)C)JaENFFME(l- 
methoxy)KAF(l- 
C(=0)C(N(QC)COC)NN 


sea hare Aplysia 
kurodai 


apoptolidin 


EFLMNAMMM 


EF(methoxy-l ,hydroxy-2)LMNAMM 


M 


aurachinB 


MLMLM 


MLMLM 




aurachinC 


MLMLM 


MLMLM 




A59770 


QQKQQLJFCDN 


QQK(2-ethyl)Q=QU(2- 

hydroxyl)F(2-0-glycosyl)CD(2- 

hydroxyl)N 


Amycolatopsis 
orientalis 


A82548A- 
cytovaricin 


QQQKQNLJFDD 
N 






A83543A 


PLFQQQ 


FLFQQQ 


S accharopolyspora 
spinosa 


AB023a 


NNNNNRSSSLS 
R 


NNNNNRSSSLSR 




AH-758 


RSNMURMN 


RS(2-methoxy)NMURMN(2- 
methoxy) 




bafilomcinD 


BNHCENMKCM 
N 


BNHCE(l-macrocyc^- 
methoxy)NMKCMN(2- 
methoxy)Q(keto-macrocyc) 


S. sp. 


bafilomycinAl 


BNHCENMKCM 
N 


BNHCE(1 -raacrooyc,2- 
methoxy)NMKCMN(2- 
methoxy)Q(keto-macrocyc) 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


borrelidin 


SNMRUUUS 


SNM(2-N)RUUUS 




calyculin 


OFBBMmNm 


QF(1 -mcthoxy)BBMniNm 


Discodcrmia calyx 


candicidin- 

candeptin-ascosin- 

levorin-etc 


CDNNNNNNNF 
CEELIEE 


(^NNNlSn^IWF(l-gJycosyl)C(l-O- 
cyc,2-carboxylicacid)EE(l -cyc)E(2- 
hydroxyl)LEEE 


S. griseus, S. 
canescus, S. 
levoris, S. 
viridoflavus Stv. 
grisoviridum 


candidin 


QDNNNNNNNF 
CEFELEEE 


RDNNI>WNi^ 

cyc^-carboxylicacid)EF(l -cyc)E(2- 
hydroxyl)LDBE 




cflrhnTnvnin 1 




±_*i ^ ^ i ^z.-cjJUAjf ^liiwr ^ i ""gijf irfUajf i ? z»" 

mcthoxy)F(l-C(=0)C) 




carbomycinB- 
magnamycinB 


FNNGOFF 


FNNGOF(l-glycosyl,2-methoxy)F(l- 

Q=o)C) 




woi UUillj(Uui*"/V* 

magnamycin- 
dpltamvpfn A4- 

NSC51001-PS97- 
3628-WC3628 


cam rixxyjri: 




i 


cVi al r , ntnvr , .i'n- 

WU» 1 w UlUjf Vll I 

myconomycin- 
aldgamycinDmiko 

i L\JXl 1 J will 




.o^x-vj-giy uusy i/in ^ i , x-cpuAyji>i vj^x- 

hydroxyl)KD( 1 -glycosyl)N 


albogriseolus 


chimeramycinB- 
PTL448 


BMNCODF 


B(0-ethyl)MNC(l-glycosyl)OD(l- 


S. ambofaciens ka- 
448 


chivosazolA 


SRRnNNSRQNn 
RMnNn 


SRR(14>macrocyc)nNNSR(l- 

UlCUlUAjf ^V^lillXv^l 

^cosyl)MnNnQ(keto-macrocyc) 


S. cellulosum 


cineromveiTiB 

1/UlVl \JXkAjt \*LILLJ 


BLME=JN 


x>i-.ivix> j ^A-uy ui uAy i y j/n 


c 
o. 

cinereochromogene 
s, S. sp. 


cineroraycinBdehy 
ro 


BLMI=JN 


BLMI=J(2-hydroxyl)N 


S. grieoviridis 


cinei'onwciiiB2 3Hi 

VU1V1 will Y WUl] * 1 1 

hyro 


BLME=JL 


BT MR=Jf2-hvdmxvftT 

ui»iT ixy»i yj& njr vji \JA.y x 


ft crriftrivirifliQ 

VJ. gllwUVlllUlS 


cirramycinBldihy 
droxy-A6888X 


BMNGODF 


BM(U-epoxy)NGOD(l-glycosyl)F 


S. flocculus 


ciiramycinB- 

cirranxycinBl- 

Acumycin- 

A688A-B58941- 

A6888A 


AMNGODF 


AM(U-epoxy)NGOD(l-glycosyl)F 


S. cirratius, S. 
griseoflavus, S. 
fradiae S. 
flocculus 


cladospolideA 


ELLFN 


ELLF(2-hydroxyl)N 


Cladosporium 
fulvum, C. 
cladosporiodes 


cladospolidcB 


ELLFn 


EUJ?(2-hydroxyl)n 


Cladosporium 
fulvum, C. 
cladosporiodes 


cladospolideC 


ELLEN 


ELLE(2.hydroxyl)N 


fungus 

Cladosporium 
tenuissimum 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


concananiycinA- 
folimycin-A66 1-1- 
S45A-TAN1323B- 
X4357B 


CENMKCCMN 


CE(2-methoxy)NMKC(2- 
ethyl)CMN(2-methoxy) 


S 

diastatochromogen 
ese, S. sp, S. 
neyagawaensis 


concanamycinB- 
S45B 


CENMKCCMN 


CE(2-methoxy)NMKCCMN(2- 
methoxy) 


s 

diastatochromo gen 
ese 


concanamycinG- 
anhy dr o c oncan am 
ycinB 


NBNHCENMKC 
CMN 


NBNHCE(2- 

methoxy)NMKCCMN(2-methoxy) 




congloblatm 


AJM 


A( 1 -oxazpyl) JM AJM 




#copiamycin 


RNRSSSSSSRLR 
SRN 


RNRSSSS(l-0-cyc)S(2- 
hYdroxvDSn -cvc^RLRSRN 


S. hygroscopicus 


cytGvaricin-H230 


QQQKQNLJFCD 
N 


OOOKONLJ^-hvdroxvDFQ-O- 
glycosyl)CD(2-hydroxyl)N 


S st> S colliTiiis 


cytovaricinB 


OOKONLJFDDN 


OOOKONUf2-hvdroxvllFf2-0- 
glycosyl)CD(2-hydroxyl)N 


S tOTUl(V5\lS 


CP64537 


ARGKDD 


A(2-glycosyl)R(l- 
hydroxyl)KDD(l-glycosyl) 


Streptomyces 

hunricola ATCC 
39491 


damavaricinC 


QDQCCNM 


QDQCCNM 


S. spectabilis 


deltamycinAl 


ENNGOFF 


EN(l,2-epoxy)NGOF(l-glycosyU- 
metiioxv)Ff 1 -C(=OK^ 


S. deltae, S. 
halstedii-deltae 


deltamycinX- 

desisovalerylcarbo 

mycinA 


ENNGOFF 


EN(l,2-cpoxy)NGOF(l-glycosyl,2- 
methoxv^Ff 1 -C(=0)C) 


S. deltae, S. 
halstedii -deltae 


cn gl crorny cin 


QNJHN 


ONJHfl-bvdroxYnNfl JrCDoxY^ 

i w A A^«r \u vaj i yi \ >^^^ a v/yv^ y 


En 2 leromvces 

AJJUglvl \J AAAJ; Wu 

goetzei 


#epothilone 


MEMLJDgE 


MEMn 2-eooxv^IJDGf2-methvnE 




erythromycin 


ADGJDD 


A(2.hydroxyl)DGJ(2-hydroxyl)D(l-gl 
elvcosvn 


ycosyl)D(l- 


espinomycinA2 


ENNCOFF 


ENNCOF(l -glycosyl,2-methoxy)F( 1 - 
C(=0)C) 


S. fimgicidicus 


filipinlll- 
lagosinl4deoxy 


ENNNNMEFFFF 
FFFL 


E(2-hydroxyl)NNNNMEFFFFFFF 


S filininensis S 

%-J ■ 1 1 Mini IVTHyrMffy 

durhamensis 


filipin-lagosin 


ENNNNMEFFFF 
FFF 




S. filipinensis, S. 
durhamQisis 


formamicin 


CBNMOCMN 


CBNMO(includes a long, branched 
alkyl chain)CMN 




foromacidmB** 

spiramycinH- 

spiramycinB 


ENNAOFF 




S. ambofaciens 


FD891 


RRUSNNLNRM 
M 


l^US(l-0-macrocyc)NNL(2- 

hydroxyl)N(l^-epoxy)RMMQ(k£to- 

macrocyc) 


S. graminofaciens 


FK895 


RNRNMRNRLS 


R(l-methoxy)N(l,2-epoxy)RNMR(l- 

0-macrocyc)NR(l-C(=0)C,2- 

hydroxyl)LSO(keto-inacrocyc) 


S. hygroscopicus 


FK-506 


MAEPMJ3BKOO 






gedamycin 


IEJBNNNnNNNF 
AEFEEEEBE 


IEJBNlWnNNOT(l-glycosyl)A(l-0- 

cyc^-carboxylicacid)H ? (2- 

cyc)EEEEIE 


S. aureofaciens 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


geldanamycin 


QULRMRnM 


QUL(2-methoxy)RMS(l-CONH2,2- 
methoxy)nM 


S. hygroscopicus 
var. gelanus 


gephyronicacid 


RRTSRRM 






gloeosporone 


ELLEIL 


ELLE(l-0-cyc)I(2-cyc^-hydroxyl)L 


Colletotrichxim 
gloeosporioides fl 
sp.jussiaea 


GERI-155 


BNLGJDN 


B(2-0-glycosyl)N(l ) 2-epoxy)LG(2- 
hvdroxvlYJD^l-elvcosvl^N 


S. GERI-155 


halomicin 


QNCRCBNM 


QNC(l-methoxy)R(l- 
CfOXOCBNM 


Mic. halophytica 


hcrbmiyciii 


ALDMFnM 

<fcJUL/irAl Mill 


AC1 -methoyvYT ,/2..meth(V)rvlDf 1 - 

methoxy)MF(l-CONH2,2- 

methoxy)nM 




hygrolidin 


CENMJCMM 

wJUl 1 A'JW N^»J»T JULY Jl 


tTl'1 -Oi-mflPTnpvr 2— 

methoxy)NMJCMMQ(ketc>- 
macrocyc) 


S HvoroQcwiVipi is 


hygrolidin-oxo 


BNHCENMJCM 
N 


methoxy)NMJCMN(2- 
methoxy)O0ceto-macrocyc) 


hygroscopicus 


immamvciii 
ja uuiMU y win 


URRNST MOOS 


glycosyl)LMR(l-0-cyc)=QS(2-. 
cyc)Q(keto-macrocyc) 




luvenimicin A 1 - 

JW? VI 1 1 1 1 UVUIfl & 

T1124A1- 
M4365A1 


BMNGFDF 

XJXVXX'tVJX XXX 


XJiYXl^ Ul A^X ^iXlwXUUCCl Oil V Illy X XXL |JvO 

2^ 


lvxiw. wiiaivca 


juvenimicinA2- 

T1124A2- 

M4365A2 


BMNGJDF 




Mic chaleea 

xtxxv* vuoiv^va 


juvenimicinA4- 

T1124A4- 

M4365A4 


BMNGODF 




Mic. chalcea-Mic. 
catrillata 


juvenimicin-Tl 124 


BNNGJDF 




Mic. chalcea 


kanchanamycin 


NMSSSSSSSRL 
RSRNN 






lankamycin- 
kujimycin- 
landavamycin- 
A20338N2 


ADGJDD 




S. violaceonifrer S 
spinichromogenes 


leinamycin 


QNNIMLN 






leucanicidin 


CENMKCMN 


<^2-metlioxy)NMKCMN(2- 
methoxy) 


S. halstedii . - 


3eucomycinA12- 
kitasomycinA12 


FNNCODF 




S. kitasatoensis 


lcucoroycinA14- 
kitasomycinA14 


FnNCOFF 




S. kitasatoensis 


leucomycinA3- 
josamycin- 
platenomycinA3- 
turimycinA5 


ENNAODF 




S. kitasatoensis, S. 
hydroscopicius, S. 
narbonensis, S. 
platensis 


leucomycinA5- 
turimcinH4 


ENNCOFF 




S. kitasatoensis 


lienomycin 


SSSNSSSTSML 
NRRNNNNNL 
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POLYKETTDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


lucensomycin- 
etruscomycin- 
lucimycn>FJ1163- 
butylpimaricin 


SNNNNSREEFN 
N 




Act. sp, S. lucensis, 
S. glaucus 


L155175 


RSNMURmM 


RS(2-methoxy)NMURmM 




L681110 


NMURMN 


NMURMN(2-methoxy) 




macrocin- 
lactenocin 


BMNHODF 




S. aureus, S. lutea, 
K. pneum, B. 
subtl., Shva.. 


macrocin-Y07625 


QMNGODF 




S. fradiae gs 16 


maridomycinl. 


ENNCOFF 




S. paltensis, S. 
rimosius, S. 
capuensis, S. 
racemochromogene 
s 


maridomycin- 

platenomycinC3- 

turimycinEPS- 

B5050A-YL704- 

C-3 


ENNCJDF 




S. hygroscopicus- 
S. platensis- 
malvinus 


mathemycinA 


RSSRSSRSSRR 
MRMKNLRL 






midecamycinAI- 
platenomycinBl- 
SF837 


ENNCOFF 


EKNCOF(2-methoxy)F 


S. mycarofaciens 


midecamycinA2- 
mydecamycinA2- 
SF837A2 


FNNDJEE 


FNNDJE(2-hydroxy)E(l-C(==0)CC) 


S. mycarofaciens 


milbemycin 


OOOMKNOO 






#mona2omycin 


SSRRSSURNSR 
SMRMRNLRSL 
L 






mycinamicin 


BNNGJDN 


B(l-cyc)NNGJDN(2-cyc) 




mycinamicinVI 


BNNGJDN 




Micromonospora . 
griseorubida sp. 


mycinamicinXl 1 


BNNGLDN 




S. aureus, S. 

pyogenes, 

Corvnebacterium 


#mycolactone 


SRMUSMSSL.sS 
MNMMN 


SRMUSMSSL.sSMNMMN 




mycolactoneA 


SRMUSMSSL 


SRMUSMSSL 




mycolactoneB 


sSMNMMN 


sSMNMMN 




myxovirescinAl 


QQEFLNNLLIL 
KJ 






myxovirescinA2 


QQEFLNNLLEJ 
J 






myxovirescinB- 
megovalicinB 


QQEFLNNLLIL 
KM- 






myxovirescinC-Cl 


QQEFLNNLLLL 
KJ 






myxovirescinD 


QQEFLNNLLLL 
KM 






myxovirescinE 


QQEFLNNLLEJ 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


myxovirescinFl 


QQEFLNNLLLL 
KJ 






myxovirescinF2 


QQEFLNNLLLL 
JJ 






niyxovirescinGl 


QQEFLNNLLLK 
J 






rayxovirescinG2 


QQEFLNNLLLJJ 






myxovirescinHl 


QQEFLNNLLIL 
KJ 






myxovirescinH2 


QQEFLNNLULJ 
J 






rnyxpvirescinL 


QQEFLNNLLIL 
HJ 






myxovirescinPl 


QQEFLNNLLIL 
LKJ 






rnyxovirescinP2 


OOEFLNNLLIL 
LJJ 






myxovirescinQ 


QQEFLNNLLIL 
KM 






myxovirescinS 


QQEFLNNLLIL 
HJ 






M4365G2 


BMNGODF 




Streptoverticillum 

Vi tn sfltnfin si r S 

thermaotolerans 


nflnciTnycin 


ONDACBNM 






neocopiamycin 


NRSSSSSRLRSR 
N 






niddamvcin- 
F3463- 

3 desac etylcarbomy 
cinB 


FNNGOFF 




B. subt 


oligomycinA 


QJNNJRGAGAN 




diastatochromogen 
es, S. chibaensis 


oligomycinB 


QJNNKCHDHD 
N 




S. 

di astatQcTirnraoppn 
es 


oligomycinB- 
44homo 


QJNNJRGRGAN 






oligomycinD 


QJNNJBGAGAN 




S. arabicus S. 
parvulus, S. 
rutgersensis, S. 
griseus, S. 
aureofaciens 


ossamycin 


QQQNLLKFDD 
N 






perimycin 


JBNNNnNNNFC 
EFEEEEIB 


JBNNNnNNNFCEFEEEEIE 




phenalanrid 


nCNNNNM 






phenalamideAl- 
fenalamid-M02-C 


JMCMNnNM 






phenalamideA2- 
102-T 


JMCMNNNM 






phenalaimdeA3 


JMCMNNnM 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


phenalamideB 


JMCMNnNM 






phenalamideC 


JMCMNNNM 






phthoramycin 


OONLUSRRN 






pikromycin 


ANGJDH 






nrflQinfm-T 155175 






ma 5285; S. 

(J1QOU1UO 


TiTAtngfrentAvaTicfn 
pi vj iva u w p iu v <u ik/ in 










1VU V-.XVXXN 






PF1163 


EJLLQ 






mjiunoiiQuniicin 


OO JU U JUOoIN IN AN 0 1 

SNNSSSUSRRQ 

WXVXvOlJ 






rapamycin 


FGMEGJNNME 




t 


I lllZAjpUUill 


JVLA^OlN IN O O 






rifamycin 


QQQNDACBNM 






mnowiQin 


131N IN INlYr l^Hr Jll-> 
TA 






rosainicin.- 
repromicin 


13 1YLLN VJ Wl/Jr 






xosaramycin 


J31Vi_LN \J KJUP 




lVllwiUIUvxlUapUxct 

rosaria 


riio mil win 


\l \£ V<?J XVlv/vJ 






rUUillljfvlXl 


nnTMMTRfrAOA 
v^V^JiNiNjj3vjr/vvy/\ 

N 






scytophycin 


BFCEENOEMN 






scytophycinB-E 


BFCEONOEMN 






shurimycin. 


iNlN JVOO OO JOIUA 
oXUNiN 






soranfflcin 


XwJNJyLT 11 DIN JbiN v>X^ 

DFNQNNnn 


hydroxyl)iiENUrc^ 




sorangicinA 


ONLNF 






sorangicinH 


Mf >JT7 

iNJL»lNl? 






oUX allgUllUC/Y 


T T T T KTKX TMFA 
FM 

X XYA 




XXXjr AU Uttv vCIlUilX 

cArfltioiiTm 
ovitm^iui 1 1 

cellulosiun 


soraphen 


EUFJDFA 






spiramycin 


nNCJDF [ 






staphcoccomycin- 

angolamycin- 

shincomycin 


CMNGODF 






stipiamide 


JMO>WnNM 






tartrolonB2 


nNLFHE 


11NLFHE 


fragment 


tedanolide 


JGEHMDHF 


JGEHMDHF(2-hydroxyl) 




thiazmotrienomyci 
n 


QLmRSNNNS 






tiacuxnicin 


SMMRMSNM 






tylosinC-macrocin 


EMNHODF 






tylosin-A 


EMNGODF 






TAN-1323 


NMURRMM 






TMC-34 


NRSSSSSSRLRS 
RN 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


venturicidraA 


CKNFLMQQO 






vicenistatin 


LNMULRNN 






viranamycinB- 

virustomycm- 

TAN1323C 


QNMKCCMN 




S. sp. cb41 


zmcophorin- 
griseochelin 


KMCNLACC 




S. griseus 



In another embodiment, the polyketide library includes the name of the polyketide, the 
CHUCKLES string and a linearized representation of the structure. The linearized 
representations of the CHUCKLES structures for erythromycin and epothilone are as follows: 



epothilone D 
MEMUDgE 




orythromycin 
ADGJOD 



An illustrative example of a polyketide library containing linearized representations of their 
structures is found in Appendix C (deposited in the microfiche appendix). 

EXAMPLE 3 

Alternative PKS Genes for Epothilone 

This example illustrates the alignment and design of novel PKS genes for the target 
epothilone. Epothilone is first converted into CHUCKLES string format and then read into the 
MORPH program as a TARGET. The program then generates all possible alignments of library 
modules and sorts the alignments to determine preferred combinations of modules for gene 
construction and production of epothilone via a novel polyketide synthase gene. 
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The epothilone D structure above was first opened at the macrolactone ring closure 
between the C-l-ketone and the C-15-oxygen. The monomer set shown in Figure 2 was then 
matched against each of the successive pairs of macrocyclic backbone carbon atoms, starting 
with C-2 and C-3, which match monomer E. The next two carbon atoms C-4 and C-5match - 
monomer G with an additional post-synthetic methyiation on C-4. C-6 and C-7 match monomer 
D. C-8 and C-9 match monomer J. C-10 and C-ll match monomer L. C-12 and C-13 match . 
monomer M C-14 and C-15, where C-15 has a hydroxyl substitution (modified by thioesterase 
to close the macrocycle), match monomer E. C-l 6 and C-l 7 match monomer M. 

The rest of the molecule, a methyl-substituted thiazole moiety, does not match any of the 
monomers in the monomer set This moiety corresponds to a malonyl CoA loading module and 
an NRPS module that together generate the methyl-substituted thiazole moiety. This moiety is 
thus omitted from the CHUCKLES string generated from this illustrative monomer set but can 
be added simply by adding a monomer to the set The CHUCKLES string generated is 
EGDJLMEM, which is in the reverse order of biosynthesis. This sequence is then reversed to 
MEMLJDGE to yield a monomer sequence that matches the order of biosynthesis. The 
sequence is then annotated to account for the post-synthetic modifications as follows 
MEMUDG (2-methyl)E. 

This target sequence is provided to the MORPH program to generate all possible 
combinations of modules in the CHUCKLES-encoded library that will yield the target 
CHUCKLES. The valid combinations are then sorted in increasing order of non-native inter- 
module interfaces. In one implementation, a MORPH run generated 3,452 valid sequences of 
five inter-module interfaces. Of these, none contain fewer than five inter-module interfaces. 
Some illustrative sample module combinations appear below. The combinations are shown 
listing each monomer followed by a colon and the name of the polyketide(s) from which it is 
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derived, followed by a parenthetical showing the associated monomers in that polyketide. 
Vertical lines represent modular junctions between two different polyketides. 

Illustrative PKS Gene 1 : 

M^acetyi^TjutyhyltylosinCFMN) | E:tedanolide(GEH) | M:aldgamycin(BML) 
L:aldgamycin(MLG) | J:aldgamycin(GJD) D:aldgamycin(JDL) | G:tedanolide(JGE) 
E:tedanolide(GBH) 

Illustrative PKS Gene 1 thus comprises one or more open reading frames that encode, in 
the order listed, the module from the acetyl-4"-butyryltylosin PKS that corresponds to monomer 
M, the module from the tedeanolide PKS corresponding to monomer E, the modules from the 
aldgamycin PKS corresponding to monomers M, L, J, and D, and the modules from the 
tedanolide PKS corresponding to monomers G and E. 

Illustrative PKS Gene 2: 

M:aIbocycline-Ml-ingramycin^^ 

Ml- ingramycin-TA2407^ineromycinB-U28010-SR2077(MEJ) J M:dbocycline-Ml- ■ 
ingramycin-TA2407-cineromycinB-U28010-SR2077(LME) | L:albocycline«Ml- ingramycin- 
TA2407^ineromycinB-U28010-SR2077 (BLM) | J:erythromycin(GJD) D:erythromycin(JDD) | 
G:tedanoUde(JGE) E:tedanolide(GEH) 

EXAMPLE 4 

Alternative PKS Genes for 6-Deoxvervthronolide B 

This example illustrates the alignment and design of novel PKS genes for the 
erythromycin basic polyketide structure (6-dEB) using the MORPH program. 



WO 01/92991 



32 



PCT/US01/17352 




For the 6-dBB structure above, the CHUCKLES string is generated by first opening the 
macrolactone ring closure between the C-l-ketone and the C-13- oxygen. Using the monomer * 
set and matching protocol described in Example 3, one generates the CHUCKLES string 
DDJGDA, in the reverse order of biosynthesis. This sequence is then reversed to ADGJDD to 
yield the monomer sequence that matches the order of biosynthesis. The sequence is then 
annotated to account for the post-synthetic modifications (erythromycin A) as follows A(Z- 
hydroxyl) DGJ(2-hydroxyl)D(l -glycosyl)D(l-glycosyl). 

This target sequence is supplied to the MORPH program to generate all possible 
combinations of modules in the CHUCKLES-encoded library. The valid combinations are then 
sorted in increasing order of non-native inter-module interfaces. In one implementation, a 
MORPH run generated 19,631 valid sequences of less than or equal to five inter-module 
interfaces. Of these, 13,306 contain 4 inter-module interfaces, and 256 contain only 3- inter- 
module interfaces. Some of these contain only two inter-module faces, and one only contains 
one. Some illustrative sample module combinations follow. 

Illustrative PKS Gene 1 : 

A:amphotericinA(EAL) | D:aldgamycin(JDL) | G:mycinamicin(NGJ) Jmycinamicin(GJD) 
D:mycinamicin(JDN) | D:amphotericinA(CDN) 

Illustrative PKS gene 1 thus comprises one or more open reading frames that encode, in 
the order listed, the amphotericin PKS module corresponding to monomer A, the aldgamycin 
PKS monomer corresponding to monomer D, the mycinamicin PKS modules corresponding to 
monomers G, J, and D, and the amphotericin PKS module corresponding to monomer D. 
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Illustrative PKS Gene 2: 

A:amphotericinA(EAL) | D:aldgamycin(IDL) | G:pikromycin(NGJ) J:pikromycin(GJD) 
D:pikromycin(JDH) | D:aldgamycin(JDL) 

Illustrative PKS Gene3: 

A:laiikamycin-kujimycin-landavamycin-A20338N2(-AD) 
landavamycin-A20338N2(ADG) G:lai^ 

J:lankamycin-kujimycin4andavamy^ | D :ossamycin(FDD) 

D:ossamycin(DDN) 

Illustrative PKS Gene 4: 

A:amphotericinA(BAL) | D:Iankamycin-kujimydn-landavam 

G: lankamycin-kuj imycin-landavamycin- A203 3 8N2 (DG J) J:lankamycin-kujimycin- 

landavamycin-A20338N2(GJD) | D:A82548A-cytovaricin(FDD) D: A82548A- 

cytovaricin(DDN) 

Illustrative PKS Gene 5: 

A:lankamycin-lmjimycin-laiidavamycin-A20338N2(-AD) Dilankamycin-kujimycin- 
landavamycin-A20338N2(ADG) G:lank^^ 
J:lankamycin-kujimycin4an(kvamycin-A20 

landavamycin- A203 3 8N2(JDD) D : lankamycin-kuj imycin-landavamycin- A203 3 8N2(DD-) 

EXAMPLES 

Source Code: 
#include <stdio.h> 

/* --siani/programs/morph/morph3.c 

PURPOSE: To traverse recursively all the entries in PKS.lib, 
generating all feasible combinations of PKS modules to make the TARGET (e.g., epothilone). 
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INPUT: libraryfile: tab-delimited CHUCKLES-coded polyketides ffle with the 
following columns: 

1. polyketidename 

2. plain CHUCKLES 

3. annotated CHUCKLES (contains information about post- 
synthetic modifications) 

4. source organism; 
targetname: user-defined name (e.g., epoD); 

targetsequence: CHUCKLES-coded polyketide of desired TARGET (e.g., 

MEMUDGE); 

X, Y, Z sets of wildcards: sets of monomers for particular positions 
appearing in target sequence (the wildcards can be used for analoging the TARGET polyketide); 
hard-coded parameters which may be reset (requires recompiling): 

NBOUNDARY_CUTOFF determines the maximum number of 
non-native inter-modular interfaces which are contained in the output (set to 5, but may be 
increased or decreased); and 

RECURSION_COUNTER_CUTOFF specifies the number of 
levels of recursion (defaults to 0, 1, 2) acceptable for the run - a large PKS library can cause 
recursion that will greatly increase run time; because of the multi-directionality of the 
alignments (using every library entry as a STARTER), there is typically no need to go beyond 2 
levels of recursion. 
OUTPUT: 

All combinations of modules that meet parameters set by user. Example 
output from MEMUDGE (epothilone D) using a subset of a PKS library is provided below. 
Vertical bars indicate non-native inter-modular interfaces. Last column contains the number of 
"pieces" that are needed to put together the PKS. 

Names of PKSs have been abbreviated to fit them in these 

comments. 

HIT M:3atyl(PMN)| E:tedan(GEH)| M:aldga(BML) L:aldga(MLG)| 

J:aldga(GJD) D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:albMl(LME) E:albMl(MEJ)| M:albMl(LME)| L:aldga(MLG)| 

J:aldga(GJD) D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:albMl(LMB) E:albMl(MEJ)| M:aldga(BML) L:aldga(MLG)| 

J:aldga(GJD) D:aldga(JDL)| G:3atyl(NGO)| E:aIbMl(MEJ)| 5 

HIT M:aIbMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aldga(MLG)| 
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J:aldga(GJD) D:aldga(JDL)| G:aldga(LGJ)| E:albMl(MEJ)| 5 
HIT M:albMl(LME) E:aIbMl(MEJ)| M:aldga(BML) L:aldga(MLG)| 
J:aldga(GJD) D:aldga(JDL)| G:aldga(LGJ)| E:albMl(MEJ)| 5 
USAGE: 

morph3 4 libraryfile -n taigetname -t targetsequence [-x X-wildcards] [-y Y- 
wildcards] [-z Z-wildcards] 
examples: 

# generate combinations that yield epothilone D 

%morph3 -1 PKS.lib -n epoD -t MEMUDGE > omorph3_epoD 

%egrep HIT omorph3_epoD | sort | uniq | sort +10 -1 1 > 
omorph3_epoD.uniq.sort 

%egrep ALIGN_TARGET omorph3_epoD > 
omorph3_epoD_STARTER_ALIGN 

. # generate combinations that yield epothilone D with a 
C13-hydroxyl 

%moiph3 -1 PKS.lib -n epoD-130H -t MEXLJDGE -x ABCD > 

oepoD-130H 

%egrep HIT oepoD-130H | sort | uniq | sort +10 -1 1 > 
oepoD-130H.uniq.sort 

%egrep ALIGNJTARGET oepoD-130H > oepoD-130HJSTARTER_ALIGN 

# generate combination that yield epothilone with the 
following wildcards (set 1) 

%morph3 -1 PKS.lib -n epoD-setl -t MEXYZDgE -x ABCD -y LEFIN -z 
JACGM>oepoD-setl 

%grep HIT oepoD-setl | sort | uniq | sort +10 -1 1 > oepoD-setl.uniq.sort 

# generate combination that yield epothilone with the following wildcards (set 2) 
%moiph3 -1 PKS.lib -n epoD-set2 -t MEXYZDgE -x JK -y EF -z JACGM > 

oepoD-set2 

%grep HIT oepoD-set2 1 sort | uniq | sort +10 -1 1 > oepoD-set2.uniq.sort 
LIMITATIONS: 

This version does not handle intra-modular modifications/splitting because 
morph is operating at the monomer level. Modifications could convert the CHUCKLES-encoded 
strings into the corresponding and equivalent SMILES and then perform more complex 
chemical analysis of the PKS molecular graphs. 
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Currently, inter-modular double bonds are present in the library, but are ingnored by the 
morph program. 

*/ 

#inchide <stdio i£> 

I* ~siani^>rograms/morph/morph3.c 
*/ 

#defineTRUE 1 
#defineFALSE 0 
#defineDEBUG_MATCH FALSE 
#defineDEBUG_STARTER FALSE 
#defineDEBUG_ALIGN FALSE 
#defineDEBUG_RECURSE FALSE 
#defineDEBUG_WILDCARD FALSE 
#defineMAXLEN 80 
#defineMAX_TYL_LEN 6 
#defineMAX_EPO_LEN 6 
#defineMAXNAMELEN 160 
#defineMAX_LIB_ENTRIES 500 
#defineMAXWILD 3 
#defineMAXBUF 200 
#defineNBOUNDARY_CUTOFF 5 . 
#defineRECURSION_COUNTER_CUTOFF 2 
#defineSTARTER_MINIMUM_ADJACENT_ALIGN 2 
#defineMINIMUM_ADJACENT_ALIGN 2 
typedef struct _lib { 

char name[MAXNAMELEN] ; 

char monomersequence [MAXNAMELEN] ; 

char annotatedsequencefMAXNAMELEN]; 

char alignedsequence[MAXNAMELEN]; 

char ahgnedPKSname[MAXLEN]|>lAXNAMELEN]; 

int boundarytorightlMAXNAMELEN]; 

int marked[MAXLEN] ; 

char context[MAXLEN][4]; 

int recursion_tagged; 

int nboundary; 
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} LIB; 

main(int argc, char **argv) 

{ 

int ii=0,jj=0,kk=0,U=0; 

int nlib=0; 

int ecount=0; 

int nfilled=0oifilledmax=O; 

int epothilonelen=0; 

int nlargestpiece=0,tnlargestpiece=0; 

int mmpass=0; 

int lcount==0; 

int new_iuimarked_entries_filled=0; 
int recursion_counter = 0; 
int nwildcard=0; 

int best_new_unmarked_entries_filled = 0; 

int smallest_acceptable_piece = 0; 

int current_iimarked=0,previoiis_nmarked==0; 

char *sptr, *eptr, *lptr,*buJ5>tr; 

char *clibptr; 

char *libraryfile; 

char *targetsequence,*targetname; 

char buflMAXBUF]; 

char wildcards[MAXWILD] [MAXLEN]; 

FILE *libp; 

LIB epotemp; 

LIB hT?rary[MAX_LIB_ENTRIES] ; 

LIB epothilone; 

char *progname; 

char **filelist, **fileptr, 

libraryfile== l,M ; 

targetsequence = 

targetname = 

for(ii=0; ii<MAXWBLD; ii++) { 

for(o=0; j]<MAXLEN; { 
wildcards[ii]Qj] = \0'; 
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} 

} 

/* process arguments */ 

filelist = fileptr « (char **)(malloc(argc * sizeof(*argv))); 
progname = *argv++; 
if(argc<2) { 

iprintf(stderr, n usage:%s -1 libraryfile -n targetaame -t targetsequence [-x X- 
wildcards] [-y Y-wildcards] [-z 2^wildcards]\n M ,progname); 
exitO; 

} • > 

while(argc- > 1) { 

if(argv[0][0] — •-• && argv[0][l] != W) { 
/* handle option */ 

*++(*argv); /* advance past the minus */ 
switch(**argv) { 

case T: /* get library input filename (PKS.lib) */ 

argv++; argc~; 

libraryfile = argv[0]; 

Q>rintf(stderr, n -1: Ubraryfile=%s\n",Kbraryfile); 
break; 

case V: /* get target name string */ 
argv-H-; argc~; 
targetname = argv[0]; 

fprintfltstderr^-t: targetname=^\n",targetname); 
break; 

case V: I* get target sequence string */ 
argv-H-; argc~; 
targetsequence = argv[0]; 

ftmntj^stderr^-t: targetsequence=%s\n",targetsequence); 
break; 

case V: /* get a wildcard string */ 
argv++; argc— ; 
strcpj^wildcardstOl^argvtO]); 
iprint^stderr^-x: 
wadcards[%dH/os\n w ,0,wildcards[0]); 
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nwildcard++; 
break; 

case y : /* get a wildcard string */ 
argv++; argc~; 
strcpy(wildcards[l],argv[0]); 
lprintftetderr/'-x: 
wildcards[%d]=%s\aM,wadcaids[l]); 

nwildcard++; 
break; 

case f z f : /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[2],argv[0]); 
fprintf(stderr/'-x: 
wildcardsLyodJ^/osNn^^ildcardsP]); 

nwildcard++; 
break; 

case X 1 : /* get a wildcard string */ 
argv++; argc--; 
strcpy(wildcards[0],argv[0]); 
fprmt^stderr/'-x: 
wildcards[%d]=%s\n",0,wildcards[0]); 

nwildcard-H-; 
break; 

case V: /* get a wildcard string */ 
argv++; argc--; 
strcpy(wildcards[l],argv[0]); 
iprintf(stdeiT, f, -x: 
wildcards[%d]=%s\n n ,l,wildcards[l]); 

nwildcard+*f; 
break; 

case 'Z 1 : /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[2],aiBv[0]); 
Q>rintfl[stdeiT,"-x: 
wildcards[%d]=%s\n ,t Awildcards[2]); 
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nwildcard-H-; 
break; 
default: 

fprintf(stderr, MO /os unknown option; 

ignored\n"*argv); 

}/*switch*/ 
} else { /* a regular filename */ 
*fileptrf+ = *argv; 
*fileptr = NULL; 

} 

argv++; 

}/*while*/ 
if(nwildcard > 0) { 

for(ii=0; ii< nwildcard; ii++) { 
fprintf(stden:, w wfl^^ 
fprintf(stdouC^ 

} 

} 

epothilone jiboundary = 0; 
for(ii=0; ii<MAXNAMELEN; ii++) { 
epothilone.name[ii] = 'NO 1 ; 
epothilone.monomersequence[ii] « M) 1 ; 
epothilone.alignedsequence[ii] = MP; 
epothilone.boundarytorightfii] = TRUE; 
for(ij=0;jj<MAXLEN; { 

epothilone.alignedPKSname[ij] [ii] = \Q'; 

epothilone.marked[jj] = FALSE; 

epothilone.context[ij][0] = W; 

q>otMonexontext[jj][l] = W; 

epothilone.context[jj][2] = W; 

} 

} 

strq>y(epothilonejiame,targetoame); 



§>rintf(stdout, "TARGET: %s\n", epothilone.monomersequence); 
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ecount = 0; 

eptr = epothiloncmonomersequence; 

while(*eptr!=='\0'){ 

Recount = 0) { 

epothilone.context[ecount][0] = f -; 
epothilone.context[ecount][l] = *eptr, 
epothilone.context[ecount][2] = *(eptr + 1); 

} else { 

i£(ecount = 
(strlen(epothilone.monomersequence) - 1 )) { 

epothilone.cx»ntext[ecount][0] » 

*(eptr-l); 

epothilone.context[ecount][l] — *eptr; 
epothilone.context[ecount][2] = 

} else { 

epothilone.context[ecount][0] = 

*(eptr-l); 

epothilone.context[ecount][l] = *eptr > 
epothilone.context[ecount][2] = 

*(eptr+l); 

} 

} 

epothilone.context[ecount][3] = \0 f ; 

eptrH-; 

ecount-H-; 

} 

forCiiH); ii<ecount; ii++) { 

^imtf(stdout, n (%s)^epothaonexontext[ii]); 

} 

/* library */ 

nlib = ge^_library(Kbrai7fileJiibmy); 
^rintftstdout/'nlfl^yodW'^ilib); 
kk=0; while(kk < nlib) { 

/* zero out the epothilone entry with respect to anew alignment */ 

for(ii=0; ii<MAXNAMELEN; ii++) { 
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epothilone.alignedsequence[ii] « W; 
epothilone.boundarytorightpi] = TRUE; 
for(u^;jy<MAXLEN;ij++) { 

epothilonaalignedPKSname[ij][ii] = ! \0 ? ; 

epothilone.marked[jj] = FALSE; 

} 

} 

/* - reset the context back to that in epothilon&- */ 
ecount = 0; 

eptr = q>othilone jtnonomersequence; 
while(*eptr 

if(ecount = 0) { 

epothilone.context[ecount][0] = 

epothilone.context[ecount][l] = *eptr; 

epothilone.context[ecount][2] = *(eptr+ 1); 

} else { 

if(ecount = 
(strlen(epothilone.monomersequence) - 1)) { 

epothilone.context[ecount][0] « 

*(eptr-D; 

epothilone.context[ecount][l] = *eptr; 
epothilonexontext[ecount][2] = 

} else { 

epothilone.context[ecount][0] = 

epothilone,context[ecount][l] = *eptr; 
epothilone.context[ecoiint][2] = 

} 

} 

epothilone.context[ecount][3] = \0 ! ; 
eptrf+; 
ecount-H-; 

— align STARTER (current library entry) and 



*(eptr-l); 



*(eptr+l); 
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epothilone */ 

sptr = libraryfkk] jnonomersequence; 
lcount=0; 

while(*sptr N I \0*) { 

fprint^stdout/'lib^ 

ldc,lcx)imt,Ubrary[kk].monomersequence[lcount]); 

sptrH-; 
lcount+4-; 

} 

/* Call maximal__adj acent_alignment until it no longer 
returns more than two adjacent modules. There is no reason to 

try to extract individual Modules, because this is done as part of the recursive filling of spaces 1 
from the library. 

*/ 

smaUest_acceptablejpiece - 2; 
eptr = epothilone.monomersequence; 
iprintf(stdout, "ALIGNJTARGET: "); 
while(*eptr!== f \0 , ){ 

Q>rintf(stdout, f, %c ,, *q>tr); 

eptrf+; 

} 

Q>rintf(stdout, ,, \n ,, ); 

iprintffatderr/'aHgning %d %s\n"^dc, library[kk].name); 
bestjaew_unmarked_entries_filled - 0; 
whtte((new_immarked_entriesjfilled = 
maximal_adjacent_alignment_and_dimi^^ 

maUest_acceptable^iece))>=S { 

if(bestjaewjmmarked_entries_filled < new_unmarked_entries_filled){ 
bestjaew_unmarked_entries_filled = 

new_unmarked_entries_filled; 

} 

if(DEBUG_STARTER) fprintf(stdout, "STARTER ALIGN: 
newjmmarked_entriesjBDed^/od\n^ 

epothilonelen = strlen(epothilonejnonomersequence); 
for(ii=0; ii< epothilonelen; ii++){ 
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if(DEBUG_STARTER) fprintf(stdout, "STARTER ALIGN:fbund 

a best alignment 

between epo jnonomer[%d]=%c in library[%d] .name=%s\n M , 
ii,epoMonejnonomersequence[ii]^epoMone.d^ 

* } 

} 

library[kk] .recursion_tagged = TRUE; 
Q>rintf(stdoirt,"ALIGN_TARGET: \n"); 
dumpJSTARTER_ahgn(epottalone^wildcar^^ 
^rintf(stdout, ,, ALIGN_TARGET: \n"); 
if(best_new_unmaiked_ratries_filled <= 1) { 

ft)rintf(stdout, ,f ALIGN - TARGET: PROBLEM 
best_new_unmarked_entries_filled = %dV,best_new_unmarked_entries_jfilled); 

^rintf(stdout,"ALIGN_TARGET: PROBLEM skipping this STARTER 
entry for library[%d] .name=%s\n", 

kk,libxary[kk] Jiame); 
library[kk] .recursionjagged = FALSE; 
kk++; 
continue; 

} 

/* - fill in the gaps from the library - */ 

/* generate a fresh copy of epothilone in epotemp */ 
epothilonelen - strlen(epothilone.monomersequence); 
nfilledmax = strlen(epothilone.monomersequence); 
fprintf(stdout, M nfilledmax=%d\n", nfiUedmax); 
reset_epotemp(&epotemp,epothilone); 
nj511ed = 0; 

for(ii=0; ii< epothilonelen; ii++) { if(epotemp.maikedtii] — TRUE) nfilled++; } 
if(DEBUGjSTARTER) ^)rintf(stdout 3 ,, nfilled from STARTER==%d\n' , ,nfiUed); 
for(mmpass = 0; mmpass < nlib; mmpass++) { 

if(mmpass =kk) { continue; } 

reset_epotemp(&epotemp,epothilone); 

nfiJled = 0; 

for(ii=0; ii< epothilonelen; ii++) { 
if[epotemp.marked[ii] = TRUE) nfiUed++; } 
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if(nfilled >= nfilledmax) { 

output_fresh_alignment(&epotemp); 

} else { 

current jomarked = nfilled; 

previous_nmarked = nfilled; 

smallest_acceptable_piece = 1 ; 

while((newiuimarked_entries__filled = 
maximal_adjac^t_aUgmnent(&epot^ 
acceptable apiece)) >=MIMMUM_ADJACEOT_ALIGN) { 

currentjamarked += 

newjunmarked_entries_filled; 

if(DEBUG_MATCH) fprintf(stdout, 
"main: recursion_Jevel=%d, mmpass=%d, previous_nmaiked=%d, 
current_nmarked=%d\n", 

recuraon_counter^nmpass,previousjomark current jamaiked); 
} 

nfilled = 0; 

for(ii=0; ii< epothilonelen; ii++) { 
if(epotemp.marked[ii] = TRUE) nfilled++; } 

if(nfilled >= nfilledmax) { 

output_fresh_alignment(&q)otemp); 

continue; /* no need to recurse */ 

} 

if(DEBUGJMATCH) fprintf(stdout, "main: 
about to RECURSE: mmpass=%d\n",mmpass); 

library[mmpass].recursion_tagged = TRUE; 

recursion_counterH-; 
recinseJhroughJheJibrary(^^ 
nlibJibraiy,&recursion_counter); 

library[mmpass] .recursion_tagged = FALSE; 

recursion_counter~; 

} 

> 

libraryjlck] .recursion_tagged = FALSE; 
kk++; 
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}/*nlib*/ 
}/*main*/ 

int recurse_through_thejibrary( 
int nfilledmax, 
LIB epotemp, 
LIB *epothilone, 
int nwildcaxd, 

char wildcards [MAXWILD] [MAXLEN] , 

int nlib, 

LIB *libraiy, 

int *recursion_counter) 

{ 

int ii=0; 

int ecount=0,elen=0; 
int mmpass=0; 
int nfilled=0; 
int lcount=0; 

int previous_nmarked=0, current_nmarked=0; 
int smallestjicceptable_piece = 0; 
char *eptr, 
char *clibptr, 

char boimdaiyjMAXNAMELEN]; 
int ncw_umnarked_entries_filled=0; 
LIB epotemp_temp; 

if(DEBUG_MATCH) ^rintf(stdoiit, n RECURSE: recursion_counter=%d, 
nKb=%d\n ,? ,*recursion_counter^iIib); 

elen - strlen(epotemp jnonomersequence); 

nfilled = 0; 

for(ii=0; ii< elen; ii-H-) { ii8[epotemp.marked[ii] — TRUE) nfiUed++; } 

previous_nmarked = nfilled; 

current_nmarked - nfilled; 

smaUest_acceptable_piece = 1 ; 

if(nfilled >= nfilledmax) { return 1; } 

for (mmpass = 0; mmpass < nlib; mmpass++) { 

if(*recuision_counter >= RECTJRSION_COUNT^_CXrrOFF) { 
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return 1; 

} 

if(DEBUG_MATCH) ^rint^stdou^^RECURSE: 
recursion_counter=%d, iimpass=^W\*recursion_coimter^Dmpass); 
if(library[mmpass].recursion_tagged = TRUE) { 

if(DEBUGJMATCH) ^rintf(stdout, ,, RECURSE: 
libraiy[%d].recursion_tagged=TRUE; skipping\n n ,mmpass); 
continue; 

} 

reset_epotemp(&epotempJemp,epotemp); 
elen = strlen(epotemp_temp.monomersequence); 
nfilled = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp_temp.marked[ii] = 
TRUE)nfilled-H-;} 

previous_nmarked = nfilled; 
current_nrnarked = nfilled; 
while((new_immarked_entries_filled = 
maximd_adjacent_alignmrat(&epotenip_temp, nwildcard,wildcards ? library, 
mmpass,smallest_acceptable_piece)) >= 1) { 

current_nmarked += new_umnaiked_entriesjfilled; 
if(DEBUGJMATCH) $rintf(stdout, "RECURSE: recursion Jevel=%d, 
mmpass=%d, previous jamarked==%d, current jimarked=%d\n", 

3 ^ursion_counter,mmpass,previous_nmarked, current_nmaiked); 
} 

elen = strlen(epotempjemp.monomersequence); 
nfilled = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp_temp.marked[ii] = 
TRUE)nfilled4+;} 

if(nfilled >= nfilledmax) { 

output_fresh_alignment(&epotemp_temp); 

continue; 

> 

library[mnipass].recursion_tagged = TRUE; 
(*recursion_counter)-H-; 
recurse_tlirough_thej^ 
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rds^bjlibrarytrecursion^counter); 

library[mmpass] .recursionjagged = FALSE; 

(*recursion_counter)--; 
' }/*mmpass*/ 
} /*recurse_through__the_Iibrary*/ 
/* 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of the largest maximal adjacent set of 
monomers inserted. 

PROCEDURE: 

*/. 

int maximal_adj acent_alignment( 
LIB *epothilone, 
int nwildcard, 

char wildcards[MAXWIIJ3][MAXLEN], 
LIB *library, 
int ilib, 

int smallest_acceptablej>iece) 
{ 

int ii=0,jj=O,kfc=0; 

int ecount==0,lcount=0; 

int epothilonelen=0; 

int nlargestpiece=0,tnlargestpiece=0; 

int hold_this_jcount=0, hold this_ecount=0; 

int wildcardmatch=FALSE; 

char *wptr, 

char *largestpiece_sptr,*largestpiece_eptr; 
char *holdJhisjplace_eptr, *hold_thisj>lace_sptr; 
int Iargestpiece_lcoimt=0,laxgestpiece_ecount=0; 
char *sptr, *eptr, *lptr,*bu^)tr, 
if(DEBUGJWILDCARD) { 
if(nwildcard>0) { 

fprintf(stdouCmaximal_: 
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wildcards[0]=%s\n w ,wildcards[0]); 
} 

} 

if(DEBUG_ALIGN) ^rintf(stdout, tr maxima]_adj acent_alignment: 
smaUest_accq)tablejiece^%d\n f ',smallest_accq)table_j)iece); 
sptr = library[ilib].monomerseqiience; 
eptr = epothilone->monomersequence; 
ecounW); 
lcount=0; 
nlargestpiece=0; 
tnlargestpiece^O; 
hold_this_place_eptr = eptr; 
hold_this_ecount = ecount; 
while (*eptr !=V){ 

sptr = UbrarypUb] jnonomersequence; 

Icount = 0; 

hold_this__place_sptr = sptr; 
hold_this_lcount = Icount; 
wildcardmatch = FALSE; 
while(*sptr!=W) { 

wildcardmatch = FALSE; 
if{epothilone->marked[ecount] = FALSE) { 

/* code for wildcards added MAS 05-16-00 */ 
wptr= M "; 

if(*eptr = 'X') { wptr = wildcards[0]; } 
else if(*eptr = T) { wptr = wildcards[l]; } 
else if(*eptr = % T) { wptr = wildcards[2]; } 
while(*wptr != ^0") { 

if(*wptr = *sptr){ 

wildcardmatch = TRUE; 

break; 

} 

wptrH*; 

} 

if((wildcardmatch = TRUE) || (*eptr — 
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*sptr)) { 

tnlargestpiece++; 

if(DEBUG_ALIGN) fprintf(stdout, 
"FOUND a match: len=%d, epo(%d, %c), Ub[%d].name=%s (%d, %c)\n", 

tnlargestpiece, 

ecount, *eptr, ilib,Ubrary[ilib] jiamejcount, *sptr); 

ifftnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
largestpiece_sptr = 



largestpiecejcount = 
largestpiece_eptr = 
largestpiece_ecount = 



holdJhis_place_sptr, 
hold_this_lcount; 
holdjMs_place_eptr; 
hold_this_ecount; 

if(DEBUG_AUGN) 
$rintf(stdout, 'TOUND a largest piece: len=%d, epo(%d, %c), 
Ub[%d].name=%s (%d, %c)\n", 

nlargestpiece, 

largestpiece_ecount, *largestpiece_eptr, 
iKb,Kbrai7[ilib].name > largestpiece_lcount, *largestpiece_sptr); 

} 

sptrf-f; 
lcount++; 
eptrf+; 
ecount-Hf; 

} else { 

tnlargestpiece « 0; 
sptrH-; 
lcount++; 
/♦NEW*/ 

holdJhis i j)lace_sptr = sptr; 
holdjhis Jcount - lcount; 
eptr = hold_this_place_eptr; 
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ecount = hold_this_ecount; 

} 

} else { 

talargestpiece = 0; 
break; 

> 

} 

talargestpiece = 0; 
eptr = hold_this_place_eptr + 1 ; 
ecount = hold_this_ecount + 1 ; 
hold_thisjplace_eptr = eptr; 
hold_this_ecount = ecount; 

> 

if(DEBUG_ALIGN) §)rintf(stdout, H ALIGN: largest piece match is %d monomers 
from %s\n"^ilargestpiece,library[ilib].name); 

if(DEBUG_ALIGN) §)rintf(stdoiit 9 " ALIGN: largestpiece_ecount=%d, 
laj£estpieceJcount=%d\nV 

laigestpiece_ecount,largestpiece_lcount); 
if(nlargestpiece >= smallest_acceptable_piece) { 

if(DEBUG_ALIGN) §)rintf(stdout, ,, ALIGN: incorporated^"); 
lcount = laigestpiecejcount; 
ecount = largestpiece_ecount; 

while(ecount < (nlargestpiece + largestpiece_ecount)) { 
epothilone->alignedsequence[ecount] = 
library[ilib].monomersequence[lcount]; 

strcpy(epothaone->aHgnedPl^nam 
stn^y(epothilone->conte^^ 

epothiloneo>marked[ecount] - TRUE; 
if(ecount < (nlargestpiece + largestpiece_ecount - 
1)) epothilone->boundarytoright[ecount] = FALSE; 
lcount-H-; 
ecountH-; 

return (nlargestpiece); 
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} /*maximal__adj acent__alignment*/ 
/* 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of the largest maximal adjacent set of 
monomers inserted. 

PROCEDURE: 

*/ 

int maximal_adjacent_alignment_and__dump( 
LIB *epothilone, 
int nwildcard, 

char wildcards[MAXWHD] [MAXLEN], 
LB ♦library, 
int ilib, 

int smallest_acceptable_piece) 
{ 

int iiK),jj===0,kkr==0; 

int ecount=0,lcount=0; 

int elen=0; 

int epothilonelen=0; 

int nlargestpiece=0,tnlargestpiece=O; 

int hold_this_lcount=0, holdjhis_ecount=0; 

int wildcardmatch=FALSE; 

chpr *wptr, 

char *largestpiece_sptr,*largestpiece_eptr, 

char *holdJhis_place_eptr, *hold_this_place_sptr; 

int largestpiece_lcount=0,larges^)iece_ecount=0; 

char *sptr, *eptr, *lptr,*bufptr; 

if(DEBUG_WILDCARD) { 
if(nwildcard>0) { 

^rintf^stdout^maximal^: 
wildcards[0]=%s\n",wildcards[0]); 
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} 

} 

iprintffatdout/'maxin^ 
smaflest_acceptablej3^ 

sptr « library[ilib].monomersequaQce; , 

eptr = epothilone->monomersequence; 

elen = strlen(epothilone->monomersequence); 

ecount=0; 

IcounM); 

nlargestpiece=0; 

tnlargestpiece=0; 

hold_this_place_eptr = eptr; 

hold Jbis_ecount « ecount; 

while (*eptr!=^0{ 

sptr = Ubraiy[ilib].monomersequence; 

lcount = 0; 

/* NEW */ 

hold_this_placejsptr = sptr; 
hold_this_lcount = lcount; 
wildcardmatch = FALSE; 
while(*sptr!-'\0 , ){ 

wildcardmatch = FALSE; 
if(epothilone->mariced[ecount] — FALSE) { 

/* code for wildcards added MAS 05-16-00 */ 
wptr=""; 

if(*eptr — X 1 ) { wptr - wildcards[0]; } 
else if(*eptr =T){ wptr = wildcards[l]; } 
else ifl^eptr = T) { wptr = wildcards[2]; } 
while(*wptr !- WO { 

if(*wptr = *sptr){ 

wildcardmatch = TRUE; 

break; 

} 

wptr++; 

} 
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if((wildcaidmatch = TRUE) || (*eptr = 

•sptr)) { 

tnlargestpiece-H-; 

if(DEBUG_ALIGN) fprintf(stdout, 
"FOUND a match: len=%d, epo(%d, %c), Iib[%d].nam^=%s (%d, %c)\n", 

tnlargestpiece, 

ecount, *eptr, ilib,library[ilib] Jiame,lcount, *sptr); 

if(tnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
largestpiece_sptr = 

hold_this_place_sptr, 

largestpiece_lcount = 

hold_this_lcount; 

largestpiece_eptr = 

hold_thisjplace_eptr; 

largestpiece_ecount = 

hold_this_ecount; 

if(DEBUG_ALIGN) 
iprintf(stdout, "FOUND a largest piece: len=%d, epo(%d, %c), lib(%d, %c)\n", 

nlargestpiece, 

largestpiece_ecount, *largestpiece_eptr, largestpiecejcount, 
*largestpiece_sptr); 

} 

sptr++; 
lcount++; 
eptrH-; 
ecount++; 

} else { 

tnlargestpiece = 0; 

sptr++; 

lcount-H-; 

hold_thisjplace_sptr = sptr; 
hold_this_lcount = lcount; 
/* NEW */ 

eptr - hoId_this_place_eptr; 
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ecount = hold_this_ecount; 

} 

} else { 

tnlargestpiece = 0; 
break; 

} 

} 

tnlargestpiece = 0; 

eptr = holdjhis jplace_eptr + 1 ; 

ecount = hold_this_ecount + 1; 

hold_this_place_eptr = eptr, 

hold_this_ecount = ecount; 

if[DEBUG_ALIGN) { 

lprintf(stdout, "incrementing 
holdJhis_place_eptr=%c, 
holdJhis_ecount^d\nVholdJlri^^ 

} 

} 

if(DEBUG__ALIGN) J^rintfCstdou^^ALIGN: largest piece match is %d monomers 
from %s\n"^ilargestpiece4ibrary[ilib].name); 

if(DEBUG_ALIGN) ^rintf(stdouV'ALIGN: largestpiece_ecount=%d, . 
largestpiece_lcount=%d\n ,T , 

largestpiece_ecount,largestpiece_lcount); 
if(nlargestpiece >= smallest_acceptable_j>iece) { 

if(DEBUG__ALIGN) ^>rintf(stdout, ,, ALIGN: incorporated\n w ); 
lcount = largestpiecejcount; 
ecount = largestpiece_ecount; 
^rint^stdout^ALIGN^TARGET: w ); 
for(ii=0; ii<argestpiece_ecount; ii-H-) { 
fprmtf(stdout, " "); 

} 

while(ecount < (nlargestpiece + largestpiece_ecount)) { 
epothilone->alignedsequence[ecount] = 
library[ilib].monomersequence[lcount]; 
stitpy(epothnone->align^ 
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strcp>{epothilone->cort 

epothilone->marked[ecount] «= TRUE; 

if(ecount < (nlargestpiece + largestpiece_ecount - 
1)) epothilone->boundarytoright[ecount] = FALSE; 

fyrintf(stdout, n %c",libraiy[ilib] .monomersequence[lcoinit]); 

lcount-H-; 

ecountt-+; 

} 

for(ii=ecount; ii<elen; ii++) { 
fprint^stdout," n ); 

} 

^rintf(stdout, n %s\n",library[ilib].iiame); 

} 

return (nlargestpiece); 
}/*maximal_adjacent_alignment_and_diimp*/ 
int output _fresh_alignment( 
LIB *epotemp) 

{ 

• int ecount=0; 
char *eptr; 

char boundary[MAXNAMELEN]; 

eptr = epotemp->monomersequence; 
ecount = 0; 

epotemp->nboundary = 0; 

strcpy(boimdary,epotemp^aUgnedPKSiiam 

while(*eptr!= , \O r ){ 

i£(epotemp->boundarytoright[ecount] === TRUE) { 
epotemp->t3boimdary++; 

} 

ecount-H-; 
eptrf+; 

} 

if(epotemp->nboundary > BOUNDARY J3LTT0FF) return 1 ; 
eptr = epotemp->monomersequence; 
ecount - 0; 
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§)rintf(stdout, ,f HIT ,f ); 
while(*eptr!= , \0 , ){ 

if(epotemp^>alignedPKSname[ecount][0] = W) { 

if(q>otemp->boimdarytori^it[ecoiint] =TRUE) { 
^rintfCstdout/^/ocrTARGCyos)! 
"^eptr^otemp-^ntextlecount]); 
} else { 

Q)rintf(stdout, M %c:TARG(%s) 
rt *q)tr 3 epotenq)->context[ecoiint]); 
} 

} else { 

if(epotemp->bouiidarytoright[ecount] = TRUE) { 
§)rintf(stdout, ,,0 /oc:%4s(%s)| 
",*eptr,epotemp->atigned^^ 
} else { 

j5>rintf(stdout, t! %c:%4s(%s) 
Veptr,epotemp->atignedPKSname[^ 

} 

} 

ecountt+; 
eptrf+; 

} 

fprintffctdouV'npiece %d",epotemp^>nboundary); 

§>rintf(stdout,"\n ,, ); 

return 1; 
} /*output_fresh_alignment*/ 
int get_library( 
char *libraryfile, 
LIB "library) 

{ 

int ii=^)jj=0Jkk=O,lcount=0; 

int nlib=0; 

char *buj^tr,buf[MAXBUF]; 

char tmonomeisequence|>lAXNAMELEN] ; 

char *lptr,*tptr, 
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FILE *libp; 

for(kk=0; kk < MAX JLIB JBNTREBS ; kk++) { 
Ubrary[kk].recursioii_tagged = FALSE; 
for(ii=0; ii<MAXNAMELEN; ii++) { 
library[kk].iiame[ii] = \0 U > 
libraiyfkk] jnonomersequencefri] = \0 9 ; 
hT>rary[kk].alignedsequence[ii] = W; 
for(ij=0; jj<VIAXLEN;ii++) { 

Kbrary[kk].aIigttedPKSname[jj][ii] = W; 
library[kk].marked[ij] « FALSE; 
library[kk].context[ij][0] = \Q % ; 
KbraryJ>k].context[ij][l] = 
/ lfcrar>lkk].cratext[ij][2] = W; . 
Kbrary[kk].context[ij][3] = V7; 

} 

} 

library[kk] .nboundary = 0; 

} 

/* read in the library from PKS.lib*/ 
if(hTCJLL^ { 

fprmtftstdout/'TRY AGAIN; couldnt open %sW',libraryfile); 

nlib=0; 

exitQ; 

} 

nlib=0; 

while(nlib < MAX_LIB^ENTRIES) { 
if(bnJLl>^gets^ 
bufptr = buf; 

if(*birfptr = W) continue; 
ifl(*bufptrNW){ 

lptr ■ library[nlib] Jiame; 

while((*bufptr !=' (^ufptr != Vty && 

(*bufptrf=W)){ 

*lptrf+ = *buft>trH-; 

} 
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*lptr»W; 

if((*bufbtr != WO && (*bufptr != V)) bufptr++; 
lptr = librarytnlib] .monomersequence; 
wbile((*bu$tr !=' 0 && (*bufptr != ^0*) && 

(*biuptr!=W)){ 

/* This code specifically deletes inter-modular 
double bonds, optional. 4 '/ 

if(*bufptr !=•=*){ 

*lptr++ = *bu4>tr++; 

} else { 

bufptr++; 

} 

} 

*lptr=W; 

if((*bufptr != \0') && (*bufptr != •\n 1 )) bufptr++; 
lptr = library[nlib].annotatedsequence; 
wtol^ufrtrN' , )&&(*bu§)tr!=^ , )&& 

(■n>ufptr!=V)){ 

*lptr+-+ = *bufptr++; 

} 

*lptr=W; . 

if((*bufptr != "NO") && (*bufbtr != V)) bufptr++; 

Q>rintf(stdout,"LIBRARY(%d) %s: 
%sW'^b4ibraiy[nHb].name,bbrary[nlib].monomersequence); 

^>rintf(stdout,"LIBRARY(%d) %s: 
%s\n ,, ^ib,ubrary[nlib].name,Ubrary[nUb].aimotatedseq^ 

nlib++; 

} 

} 

fclose(libp); 

for(kk=0; kk< nlib; kk++) { 
lcount = 0; 

lptr = library[kk].monomersequeuce; 
while(*lptr!=W){ 

if(lcount = 0) { 
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Ubrary[kk].context[lcount][0] = 
library[kk].context[lcount][l] - *lptr, 
Ubrary[kk].context[lcount][2] = *(lptr + 1) 

} else { 

if(lcount = 
(strlen(library[kk] .monomersequence) - 1)){ 

library[kk].context[lcount][0] = 

*(lptr-l); 

library{TUc].context[lcount][l] = 

*lptr; 

libraryfkk] .context[lcount][2] = •-»; 

} else { 

Ubrary[kk].context[lcount][0] = 

*(lptr-l); 

library[kk].context[lcount][l] = 

*lptr, 

library[kk].context[lcount][2] = 

*(lptr+l); 

} 

} 

bT5raryjTkk].context[lcoiint][3] = *\0 f ; 

IptrH-; 

IcountH-; 

} 

^)rintf(stdout, f, LIBRARY(%d) %s: 
%s\n" Jkk ,library[lck] .name,library[kk] .monomersequence); 

^rint^stdou^TIBRARYCyod) %s: 
%s\n" Jck,library[ldc] Jiame^ibraiyfkk] .annotatedsequence); 

for(jj^;jj<sMen(Hbrary(^^ { 
$rmtf(stdout/X%s)"^ 

} 

fprintffctdouVW'); 

} 

return nlib; 
}/*get_library*/ 
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int dump_STARTER_align( 
LIB epothilone, 
int nwildcard, 

char wildcards[MAXWILD] [MAXLEN] 

) 

{ 

int elerr=0; 

int ecount=0^iold_ecount=0; 
int wildcardmatch=FALSE; 
char *sptr,*eptr,*wptr; 

elen = strlen(epothilone.monomersequence); 
eptr = epothilone.monomersequence; 
J^jrint^stdout "ALIGN.TARGET: "); 
while^eptrN^)! 

fyrint^stdou^^/oc^^eptr); 
eptrH-; 

} 

§)rintf(stdout,"\n"); 
ecount=0; 

eptr = epotliilone.monoinerseqiience; 
sptr = epothilone.alignedsequence; 
fprintf(stdout, "ALIGN_TARGET: "); 
while(*eptr!= , \0 , ){ 

wildcardmatch = FALSE; 

wptr= 

if(*eptr = X) { wptr = wildcards[0]; } 
else if(*eptr — r Y T ) { wptr = wildcardsfl]; } 
else if(*eptr = 'Z 1 ) { wptr = wildcaids[2]; } 
while(*wptr != W) { 

if(*wptf==*sptr){ 

wildcardmatch = TRUE; 

break; 

} 

WptrH-; 

} 
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if(wildcardmatch =» TRUE) { 
fprtatfCstdout,":"); 

} else { 

if(*eptr=*sptr){ 

- fprintfltstdout/T'); 
} else { 

fprintf(stdout, " "); 

} 

} 

eptrH-; 
sptrH-; 

} 

§>rintf(5tdout, f, \ii ,f ); 

fprintf(stdout, "ALIGN_TARGET: "); 
eptr = epothiloncmonomersequence; 
ecount=0; 

while(*eptr != *\0 f ) { 

if(epothilone.alignedsequence[ecount] — \(f) { 
fcrintf(stdout," "); 

} else { 
fprmtffetdouV^c'^q)^ 

hold_ecount = ecount; 

} 

eptrH-; 
ecount++; 

> 

fprinttfstdout," 

%s\n^epothilone.aKgnedPKSname[hold_ecount]); 

^rintf(stdout, ,, STARTER - ALIGN:\n"); 
}/*dump__STARTER__align*/ 

reset_epotemp( 
LIB *epotemp, 
LIB epothilone) 
{ 

int jj=0, eleiF=0; 
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elen = strlen(epothilone.monomerscquence); 

strq>y(epotemp->name,epothilone.name); 

strcpy(epotemp->mQnomersequence,epot^^ 

strq>y(epotemp->aUgnedsequence,epotirilon^ 

epotemp->nboundary = 0; 

forQj=0;ij<elen;jj++) { 

strcpy(epotemp->ahgnedPKSname[^ 

epotemp->marked[jj] = epothUone.marked[jj]; 
strcpyCepotemp-^ontextQjljepothilone.context^ 
epotemp->boundarytoright[jj] = epothilone.boundarytoright[jj]; 

} 

}/*reset_epotemp*/ 

EXAMPLE 6 

Source Code: 
#include <stdio.h> 

/* -siani/programs/morph/morph4.c 

PURPOSE: To recursively traverse all the entries in PKS.lib, generating all feasible 
combinations of PKS modules to make the TARGET (e.g., epothilone). 

INPUT: -b number _boundary_cuto£f: lets user set the maximum number of 
boundaries in output lines. This defeults to 5 (#define NBOUNDARY CUTOFF 5) which is a 
reasonable assumption for something of the length of epothilone (8 modules). However, when 
looking at disco-dennolide which has 1 1 modules, a cutoff of 5 sometimes results in too few 
output lines; it is too restrictive. 

-d allows one to ignore the inter-modular doublebonds in the library file. 

-1 libraryfile: tab-delimited CHUCKLES-coded polyketides file with the following 
columns 

L polyketide name 

2. plain CHUCKLES 

3. annotated CHUCKLES (contains information about post-synthetic 

modifications) 

4. source organism 

-n targetname: user-defined name (e.g., epoD) 

-t targetsequence: CHUCKLES-coded polyketide of desired TARGET (e.g., 
MEMUDGE) 

-w, -x, -y, -z sets of wildcards: sets of monomers for particular positions appearing in 
targetsequence. The wildcards can effectively be used for analoging the TARGET polyketide. 

Hard-coded parameters which may be reset (requires recompiling): 
#defineNBOUNDARY__CUTOFF 5 

NBOUNDARY_CUTOFF determines the maximum number of non-native inter- 
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modular interfaces which are contained in the output This is now set to 5, but may be increased 
when the user does not care about inefficiencies introduced by these interfaces or when the 
targetsequence is very lengthy. 

#defineRECmSION_COUNTER_CUTOFF 2 

RECURSION_COUNTER_CUTOFF specifies the number of levels of recursion 
(defaults to 0, 1 , 2) acceptable for the run. This limit must be set since the large PKS library can 
result in recursion that will combinatorially explode. Because of the multi-directionality of the 
alignments (using every library entry as a STARTER), there is no need to go beyond 2 levels of 
recursion. However, there may be cases in the future where this number should be increased. 
Note that while recursion will eventually terminate without this parameter, runs with a library 
over about 20 PKS entries may run for years on a reasonably fast computer. 

OUTPUT: All combinations of modules that meet parameters set by user. 

Example output from MEMLJDGE (epothilone D) using subset of PKS.lib. 
Vertical bars indicate non-native inter-modular interfaces. Last column contains the number of 
"pieces" that are needed to put together the PKS. 

Names of PKSs have been abbreviated to fit them in these comments. 
HIT M:3atyl(FMN)| E:tedan(GEH)| M:aldga(BML) L:aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:albMl(LME) E:albMl(MEJ)| M:albMl(LME)| L:aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HTT M:albMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aldga(MLG)| . J:aldga(GJD) 
D:aldga(JDL)| G:3atyl(NGO)| E:albMl(MEJ)| 5 

HTT M:albMl(LME) E:aIbMl(MEJ)| M:aldga(BML) L:aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:aldga(LGJ)| E:albMl(MEJ)| 5 

HTT M:albMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:aldga(LGJ)| E:albMl(MEJ)| 5 

USAGE: 

morph3 -1 libraryfile -n targetname -t targetsequence [-w W-wildcards] [-x X- 
wildcards] [-y Y-wildcards] [-z Z-wildcards] -d 

examples: 

# generate combinations that yield epothilone D 

%morph3 -1 PKS.lib -n epoD -t MEMLJDGE > omorph3_epoD 

%egrep HIT omoiph3_epoD | sort | uniq | sort +10 -1 1 > 
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omoiph3_epoD.uniq.sort 

%egrep ALIGN JTARGET omorph3_epoD > 
omoiph3_epoDJ3TARTER_ALIGN 

# generate combinations that yield epothilone D with a C13-hydroxyl 
%moiph3 -1 PKS.lib -n epoD-130H -t MEXLJDGE -x ABCD > oepoD-130H 
%egrep HIT oepoD-130H | sort | uniq | sort +10 -1 1 > oepoD-130H.nniq.sort 
%egrep ALIGNJTARGET oepoD-130H > oepoD- 1 30HJSTARTER__ALIGN 

# generate combination that yield epothilone with the following wildcards (set 1) 
%morph3 -1 PKS.lib -n epoD-setl -t MEXYZDgE -x ABCD -y LEFIN -z 

JACGM>oepoD-setl 

%grep HIT oepoD-setl | sort | uniq | sort +10 -1 1 > oepoD-setl.uniq.sort 

# generate combination that yield epothilone with the following wildcards (set 2) 
%moiph3 -1 PKSJib -n epoD-set2 -t MEXYZDgE -x JK -y EF ~z JACGM > 

oepoD-set2 

%grep HIT oepoD-set2 | sort | uniq | sort +10 -1 1 > oepoD-set2.nniq.sort 
LIMITATIONS: 

Current implementation cannot handle intra-modular modifications/splitting 
because morph is operating at the monomer level. Future implementations could convert the 
CHUCKLES-encoded strings into the corresponding and equivalent SMILES and then perform 
more complex chemical analysis of the PKS molecular graphs. Currently, inter-modular double 
bonds are present in the library, but are ignored by the moiph program. 

MODIFICATIONS: 

+ added ability to include user-defined wildcards (X, Y, or Z) on the 

command line. MAS 05-1 6-00. 
+ added additional wildcard (W). MAS 05-30-00. 
+ added addition (summary) column to HIT output list. MAS 05-30-00. 
+ added command line argument for suppressing the inter-modular double bonds 
in the library. Default is not to treat these as separate modules. MAS 05-3 1-00. 

+ added column that contains the length of the largest matching fragment MAS 

06-05-00 
*/ 
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#defineTRUE 1 
#defineFALSE 0 
#defineDEBUG_MATCH FALSE 
#defineDEBUG_STARTER FALSE 
#defineDEBUG_ALIGN FALSE 
#defineDEBUGJRECURSE FALSE 
#defineDEBUG_WILDCARD FALSE 

#defineMAXLEN 80 

#defineMAX_TYL_LEN 6 

#defineMAX_EPO_LEN 6 

#defineMAXNAMELEN 500 
#defineMAX_LIB_ENTRIES 500 
#defineMAXWILD 4 
#defineMAXBUF 1000 

#defineNB OUND ARY_CUTOFF 5 
#defineRECXJRSION_COUNIER_CUTOFF 2 
#defineSTARTER_MINIMUM_ADJACENT_ALIGN 2 
#de£meMINIMUM_ADJACENT_ALIGN 2 



typedef struct Jib { 


char 


name[MAXNAMELEN] ; 


char 


monomersequence[MAXNAMELEN]; 


char 


annotatedsequenc^[MAXNAMELHSf]; 


char. 


aUgnedsequence[MAXNAMELEN]; 


char 


aKgnedPKSname[MAXI^^ 


int 


boundarytoright[MAXNAMELEN]; 


int 


niaiked[MAXLEN]; 


char 


context[MAXLEN][4]; 


int 


recursionjagged; 


int 


nboundary; 


}LB; 
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main(int argc, char **argv) 
{ 

int ii=0, jj=0, kk=0, U=0; 

int nlib=0; 

int ecount=0; 

int nfilled=0,nfilledmax=O; 

int epothilonelen=0; 

int nIargestpiece=0,talargestpiece=0; 

int mmpass=0; 

int lcount=0; 

int new_unmarked_entries_filled=0; 

int recursion_counter = 0; 

int nwildcard=0; 

int best_newjinmaiked_entries_filled = 0; 

int smaUest_acceptable_j)iece = 0; 

int current_nmarked==0,previous_ninarked==0; 

int inter_modular_db_flag^off = FALSE; 

int nboundary_cuto^OTOUNDARY_CUTOFF; 

char *sptr, *eptr, *lptr,*bufptr; 

char *clibptr; 

char *Ubraryfile; 

char *targetsequence,*targetname; 

char buf[MAXBUF]; 

char wildcards[MAXWTLD] [MAXLEN] ; 

FILE *libp; 

LIB epotemp; 

LIB UbraryDVIAX^Lro^ENTRIES]; 

LIB epothilone; 



char *progname; 

char **filelist, **fileptr, 
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libraryfile=""; 
targetsequence = ""; 
targetname = ""; 

for(ii=0; ii<MAXWILD; ii++) { 

foi(ij=0;ij<MAXLEN; jj++) { 
wUdcards[ii][ij] = W; 

} 

> 

/* process arguments */ 

fUelist = fileptr = (char **)(malloc(argc * sizeof[*argv))); 

progname = *argv++; . 
if(argc<2) { 

fprintf(stderT, ,, usage:%s [-b nboimdary_cutoff] [-d] -1 libraryfile -n targetname -t 
targetsequence [-w W-wildcards] [-x X-wildcards] [-y Y-wildcards] [-z Z-wildcards] 
V.progname); 

exitO; 

} 

while(argc— > 1) { 

i£(argv[0][0] = && aigv[0][l] != W) { 
/* handle option */ 

*++(*argv); /* advance past the minus */ 
switch(**argv) { 

case V: /* get number of boundaries cutoff for output of 

alignments */ . 

argv++; argc-; 

s?canf(argv[0],"%d f, ,&nboundary_cutoff); 
§)rintf(stderr f tf -b: 
nboimdary^cutofM^W'^iboundary^cutoflf); 

break; 

case f d*: /* ignore inter-modular double bonds in the library file */ 
intermodular jib Jlag^off = TRUE; 



69 



PCT/US01/17352 



fprintfCstderr/'-d: inter-modular double bonds ignoredAn"); 
break; 

case T: /* get library input fflename (PKS.lib) */ 
argvHH-; argc~; 
libraryfile « argv[0]; 

fprintf(stderr, M -l: libraryfae=%s\n"4ibraryfile); 
break; 

case V: I* get target name string */ 
argv++; argc-; 
targetname = argv[0]; 

fprinti^stderr/'-t: targetname=%s\n n ,targetname); 
break; 

case 't 1 : I* get target sequence string */ 
argv++; argc-; 
targetsequence = argv[0]; 

fprintf(st<Ierr, ,f -t: targetsequence=%s\n\targetsequence); 
break; 

case V: /* get a wildcard string */ 
argv++; argc--; 
strcpy(wildcards[0],argv[0]); 

$rintf(stderr,''-w: wildcards[%d]=%s\n w t O,wildcards[0]); 

nwildcard++; 

break; 

case V: /* get a wildcard string */ 
argW-f; argc-; 
stn^y(wildcards[l],argv[0]); 
* ^rintfCstderr/'-x: wildcards[%d]=^/os\a" > l,wildcards[l]); 
nwildcard-H-; 
break; 

case y : /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[2] > argv[0]); 

ftrintflstderr/'-y: wildcards[%d]=%s\n",2,wildcards[2]); 

nwildcard++; 

break; 



70 



PCT/US01/17352 



case V: /* get a wildcard string */ 
argv++; argc-; 
strq)y(wildcards[3],argv[0]); 

^iint^stderr,"^: wildcards[ 0 /od]«°/os\n"3,wildcards[3]); 

nwildcard++; 

break; 

case W: /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[0]»argv[0]); 

ftrin^stderr/'-w: wildcards[%d]=%s\n ,f ,0,wildcards[0]); 

nwildcard++; 

break; 

case 'X': /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[l],argvtO]); 

frrintfltstderr/'-x: wildc^rds^d^/os^^^dcardstl]); 

nwildcard-H-; 

break; 

case T 1 : /* £et a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[2],argv[0]); 
, ^rintfCstderr/'-y: wildcards^d^/osV^^dcardsP]); 
nwildcard-H-; 
break; 

case 'Z 1 : /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[3],argv[0]); 

Q>rintf(stdeir,"-z: wildcards[%d]=%s\n tf 3,wildcards[3]); 
nwildcard-H-;. 
break; 
default: 

^printf(stderr, ,, %s unknown option; ignored\n !t ,*argv); 

}/*switch*/ 
else { /* a regular filename */ 
*fileptrH- = *argv; 
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*fileptr=NULL; 

} 

aigv++; 

}/*while*/ 



if(nwildcard>0) { 

for(ii=0; ii<nwildcaid; ii++) { 
fprint^stdenr/wldc^ 

^riotf(stdouV'wildcards[%d]=%s\n ,, ,ii,wildcards[ii]); 

} 

} 

epotfailone jiboundary = 0; 

for(ii=0; ii<MAXNAMELEN; ii-H-) { 

epothilone.name[ii] = 'NO'; 

epothilone.monomersequence[ii] = W; 

epothilone.alignedsequence[ii] = W; 

q)otMone.boundarytorigJittii] = TRUE; 

for(jj=0; jj<MAXLEN; { 

q>othilone.aKgnedPKSname[jJ][ii] « W; 
epothilone.marked[ij] = FALSE; 
epothilone.context[jj][0] = l \0 f ; 
epothilone.context[jj][l] = W; 
epothilone.context[ij][2] « W; 

} 

} 

strcpy(qpothilone.name,targetname); 
strcpy(q)othaone.monomersequence,targetsequence); 

fprintf(stdout, "TARGET: %s\q", epothilone.monomersequence); 



ecount = 0; 
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eptr = epothilone.monomersequence; 

whU<<*eptr!='\0 , ){ 

if(ecoimt = 0) { 

epoMone.context[ecount][0] = 
epothilone.context[ecount]tl] = *epta~ 
epothilone.context[ecount][2] = *(eptr + 1); 

} else { 

if(ecount = (strlen(epothilone.monomersequence) - 1)){ 
epothilone.context[ecount][0] = *(eptr - 1); 
epothilone.context[ecount][l] = *eptr; 
epothilonexontext[ecotmt][2] = 

} else { 

epothilone.contextfecoimt] [0] = *(eptr - 1); 
q>othttone.context[ecx>unt][l] = *eptr, 
epothilone.context[ecount][2] = *(eptr + 1); 

} 

} 

epothilone.context[ecount][3] = ■\0'; 

eptrH-; 

ecountt-+; 

} 

for(ii=0; ii<ecount; ii-H-) { 
fpiintf[std^ 

} 



/* library */ 

nlib = getJibrary(HbraryfUe,lfo^^ 
J^riiit^stdout^nlib^/odW'^lib); 



kk=0; wbile(kk < nlib) { 
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/* zero out the epothilone entry with respect to a new alignment 

for(ii=0; ii<MAXNAMELEN; ii++) { 

epothilone.alignedsequence[ii] = W; 
epothilone.boundarytorightfii] « TRUE; 
forQj=0; jj<MAXLEN; jj++) { 

epoliulone.alignedPKSname[ii][ii] = *\0'; 
epothilone.marked[ij] ■ FALSE; 

} 

} 

/* reset the context back to that in epothilone */ 

ecount = 0; 

eptr = epothilone jnonomersequence; 
while(*eptr !=W){ 

if(ecount = 0) { 

epothilone.context[ecount][0] = tJ ; 

epothilone.context[ecount][l] = *eptr; 

epothilone.context[ecount][2] = *(eptr + 1); 

} else { 

if[ecount = (strlen(epothilone.monomersequence) - 1)){ 
epothilonexontext[ecount][0] = *(eptr- 1); 
epothilone.context[ecount][l] = *eptr; 
epothilone.context[ecount][2] = 

} else { 

epothilone.context[ecount][0] = *(eptr - 1); 
. epothilone.context[ecount][l] = *eptr, 
epothilone.context[ecount][2] = *(eptr + 1); 

} 

} 

epothilone.contextfecount] [3] = \0 U 9 

eptrH-; 

ecount++; 

} 

align STARTER (current library entry) and epothilone */ 
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sptr « libraiyfkk] .monomersequence; 
lcount=0; 

while(*sptr!= , \0 , ){ , 
iprint^stdouV'Ub^^ 

kk,lcount,library[kk] .monomersequenceflcount]); 

sptrH-; 
lcount-H-; 

} 

/* Call maximal_adj acent_alignment until it no longer returns more than 
two adjacent modules. There is really no reason to try to extract 
individual modules because this will be done as part of the 
recursive filling of spaces from the library. 

*/ 

smallest_acceptable__piece = 2; 
eptr = epothilone jnonomersequence; 
ft)rintf(stdout, "ALIGN_TARGET: "); 
while(*eptr 

Q>rintf(stdout,"%c",*eptr); 

eptrH-; 

} 

frrintftstdouVV'); 

4)rintf(stderr, M aligning %d %s\n f, ,kk, library[kk].name); 

best_new_unmarked_entries_filled = 0; 
while((new _unmarked_entries_fiUed = 
maximal_adjacent_ahgnment_and_dump 

acceptable jiece)) >= STARTER_MIMMUM_ADJACENT_ALIGN) { 

if(best_new_unmarked_entries_filled < new_unmarked_entries_filled){ 
bestjiew_unmarked_entries_filled = 
new_unmarked_entriesjBlled; 

} 

if(DEBUGjSTARTER) fprintf(stdout, "STARTER ALIGN: 
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new_unmarked_entriesjl^ 

epothilonelen = strlen(epothilone.monomersequence); 
for(ii=0; ii< epothilonelen; ii++){ 

if(DEBUG_STARTER) §jrintf(stdout, "STARTER ALIGN:found 
a best alignment between epo.monomer[%d]=%c in libraiy[%d] Jiame=%s\n", 

ii,q>othilonejnonomersequence[ii]4^ 
} 

} 

libraryfkk] .recursionjagged = TRUE; 

^iintf(stdout, ,, ALIGN - TARGET: \n»); 
dump_STARTER_aUgn(epothilone^wildcard,wildcards); 
ft>rintf(stdout, tt ALIGN_TARGET: \n"); 



if(best_new_unmaiiced_entries j _filled <= 1) { 

§)rintf(stdout,"ALIGN_TARGET: PROBLEM 
best_new__unmarked_entries_filled = %d\n n ,best_new_xinmarked_entries_filled); 

^rint^stdou^^ALIGN^TARGET: PROBLEM skipping this STARTER 
entry for library[%d] .name=%s\n", 

kk,library[kk] .name); 
Ubrary[kk].rectirsion_tagged « FALSE; 
kk++; 
continue; 

} 



fill in the gaps from the library */ 



/* generate a fresh copy of epothilone in epotemp */ 
epothilonelen = strlen(epothilone.monomersequence); 
nfilledmax = strlen(epothilone.monomersequence); 
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fprintf(stdout, M nfilledmax=%d\n M , nfilledmax); 

reset_epotmp(&epotemp,epothilone); 
nfilled = 0; 

for(ii=0; ii< epothilonelen; ii++) { if(epotemp.marked[ii] — TRUE) nfilled++; } 
if(DEBUG_STARTER) j^rint^stdout/'nfilled from STARTER=°/od\n",nfiUed); 

for(mmpass = 0; mmpass < ulib; mmpass++) { 

if(mmpass = kk) { continue; } 
reset_epotemp(&epotemp,epothilone); 



nfilled « 0; 

for(ii=0; ii< epothilonelen; ii++) { if(epotemp.marked[ii] = TRUE) 

nfilled++; } 



Unfilled >= nfilledmax) { 

output_fresh_aMgnment(&epotemp,nboimdary_cutofl^; 

} else { 

cuirent_nmarked = nfilled; 
previous jomarked = nfilled; 
smaUest_acceptable_piece = 2; 
while((new_unmarked__entries_filled = 

maximal_adjac#nt_aHgnm^ 

lejiece)) >= MINIMUM_ADJACENT_AUGN) { 

current jamarked 4— new jinmarked_entries_fiDed; 

if(DEBUG_MATCH) fprintf(stdout, "main: 
recursion_level=%d, mmpass=%d, previous jamaiked=%d, current_nmarked=%d\n", 

recuraion_counter,nunpass,previous_nmarked, 

current_nmarked); 
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TRUE)nfiUed++;.} 



mmpass=%d\n" > minpass); 



nfilled = 0; 

for(ii=0; ii< epothilonelen; ii++) { if(epotemp.marked[ii] = 

if[nfilled >= nfilledmax) { 

output J6resh_aHgnment(&epot^ 
continue; /* no need to recurse */ 

} 

if(DEBUG_MATCH) $rintf(stdout, "main: about to RECURSE: 

library[mmpass] .recursionjagged = TRUE; 
recursion_counterf+; 



recurse_throughJhe_Hbrar^^ 
iT>rary,feecursion_counter^boundary_cutofi); 

libraryfmmpass] jecursionjagged = FALSE; 
recursion__counter— ; 

> 

} 

libraryfkk] .recursionjagged = FALSE; 
kk++; 
}/*nlib*/ 

}/*main*7 



PURPOSE: 
INPUT: 
OUTPUT: 
PROCEDURE: 

*/ 

int recurse_through_the_library( 
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int nfiUedmax, 

LIB epotemp, 

LIB *epothilone, 

int nwildcard, 

char wildcards[MAXWILD][MAXLEN], 

int nlib, 

LIB *library, 

int *recursion_counter, 

int nboundary_cutoff) 

{ 

int ii«0; 

int ecount=0,elen=0; 

int mmpass=0; . 1 

int nfilled=0; 

int lcount=0; 

int previous _nmarked=0, current_nmarked=0; 

int ■smallest_acceptablej>iece = 0; 

char *eptr, 

char *clibptr; 

char bo\mdary[MAXNAMELEN]; 

int new_unmaiked_entries_filled==0; 

LIB epotemp_temp; 

if(DEBUG_MATCH) ^rint^stdout.TlECURSE: imirsion_countei=%d, 
nlib=%d\n M ,*recursion_counter^ilib); 

elen = strlen(epotemp jnonomersequence); 
nfiUed = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp.marked[ii] — TRUE) nfiUed++; } 
previous_nmarked = nfilled; 
current jamarked = nfilled; 
smaUest_acceptable_piece = 1; 
if(nfilled >== nfiUedmax) { return 1; } 

for (mmpass = 0; mmpass < nlib; mmpass-H-) { 
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if(*recursion_counter >= RECURSION_COUNTER_OJTOFF) { 
return 1; 

} 

if(DEBUG_MATCH) fprintf(stdout, ,l RECURSE: recursion_coiinter==^d, 
mmpass^odW^ecursion_coimter,mmpass); 

if(libraiy[inmpass] jecursion_tagged = TRUE) { 

if(DEBUG_MATCH) ^rintf(stdout, ff KEClJRSE: 
library[%d].recursion_jtagged=TRUE; skipping\n",minpass); 
continue; 

} 

reset_epotOT3p(&epotemp_tCTQp,epotemp); 

elen = strlen(epotemp__temp.monomersequence); 
nfiiled = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp_temp.marked[ii] = TRUE) nfilled++; } 
previous_nmarked = nfiiled; 
current_ntnaiked = nfiiled; 

while((new_unmarked_entries_filled = 
maximal_adjacent_alignment(&epotemp_temp, nwildoard,wildcards4ibrary, 
mmpass,smallest_acceptable_piece)) >= 1) { 

current jomarked += new_unmarked_entries_filled; 
if(DEBUG_MATCH) fprintf(stdout, "RECURSE: recursion Jevel=%d, 
nunpass=%d, previous_nmarked=%d, current_jimarked 5 ==%d\n M , 

*recuision_counter^nmpass,previoiis_nmarked, 

current jamarked); 

} 

elen = strlen(epotemp_temp.monomersequence); 
nfiiled = 0; 

for(ii=0; ii< elen; ii++) { if(epotempJemp.marked[ii] = TRUE) nfilled++; } 
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if(nfilled >= nfilledmax) { 
v output Jresh_alignment(&q>ot^^ 
continue; 

} 

library[mmpass] .recursion Jagged = TRUE; 
(*recursionjx)unter)++; 

recursejhroughjhejibraiy^^ 
nKbjlibrary^rairsion^countCT^oundaiy^cutoff); 

library[innipass].recursion_tagged = FALSE; 
(*recursion_counter)-; 

}/* mmpass */ 

}/*recurse_throug^_tiie_library*/ 



/* 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of the largest maximal adjacent set of monomers inserted. 
PROCEDURE: 

*/ 

int maximal_adjacent_alignment( 
LIB *epothilone, 
int nwildcard, 

char wadcards[MAXWIIJ)][MAXl£N], 
LIB "library, 
int ilib, 

int smaHest_acceptab!ejpiece) 
{ 

int ii=0,jj=0,kk=0; 
int ecount=0,lcount=0; 
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int epothilonelen=0; 

int nlargestpiece=0,tnlargestpieceF=0; 

int hold_this_lcount==0, holdjthis_ecount==0; 

int wildcardmatch=FALSE; 

char *wptr; 

char *largestpiecejsptr,*largestpiece_eptr, 

char *holdJhis_place_eptr, *holdJhis_place_sptr, 

int largestpiece_lcount=0,largestpiece_ecount=0; 

char *sptr, *eptr, *lptr,*bufptr; 

if(DEBUG_WILDCARD) { 
i^nwildcard>0) { 

^>rintf(stdout, ,f maximaL; wildcards[0]=%s\n",wildcards[0]); 

} 

} 

if(DEBUG_ALIGN) ^rintf(stdouVtaaximal_adjacent_alignment: 
smallest_acc^tablejiece=%d\n tt 5 smallest_acceptablejiece); 

sptr = library[ilib].monomersequence; 

eptr = epothilone->raonomersequence; 

ecount=0; 

lcount=0; 

nlargestpiece=0; 

tnlargestpiece=0; 

hold_thisj>lace_eptr= eptr; 

hold__this_ecount = ecount; 

while (*eptr { 



sptr = Ubrary[ilib].mononiersequence; 
Icount = 0; 

holdJhisj>lace_sptr = sptr, 
hold_thisJcount = Icount; 
wildcardmatch = FALSE; 
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while(*sptr!=='\0 ! ) { 

wildcardmatch = FALSE; 
if(epothilone->marked[ecount] — FALSE) { 

/* code for wildcards added MAS 05-16-00 */ 

wptr = m »; 

if(*eptr = W) { wptr = wildcards[0]; } 
else ifl[*eptr — X 1 ) { wptr - wildcards[l]; } 
else if(*eptr = V) { wptr = wfldcards[2]; } 
else if[*eptr = *Z) { wptr = wildcards[3]; } 

while(*wptr != 'NO 1 ) { 

if(*wptr = *sptr){ 

wildcardmatch « TRUE; 
break; 

} 

wptrH-; 

} 

if((wildcardmatch = TRUE) || (*eptr = *sptr)) { 
tnlargestpiece++; * 

if(DEBUG_ALIGN) fprintf(stdout 3 "FOUND a match: 
len=%d, epo(%d, %c), Ub[%d].name=%s (%d, %c)\n'\ 

tnlargestpiece, ecount, *eptr, 

UibJibrary[ilib].name,lcount, *sptr); 

if(tnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
largestpiece_sptr = liold this_placej5ptr, 
largestpiecelcount = hold Jhisjcount; 
largestpiece_eptr = hold_this_place_eptr; 
largestpieceecount = hold_this_ecount; 
i£(DEBUG_ALIGN) $rintf(stdout, "FOUND a 
largest piece: len=%d, epo(%d, %c), Iib[%d].name^=%s (%d, %c)\n ,f , 

nlargestpiece, largestpiece_ecount, 
*laigestpiece_eptr, ilib4ibrary[ilib].name4argestpiece_lcount, *largestpiece_sptr); 

} 
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Sptr++; 
lcount++; 
eptrH-; 
ecount++; 

} else { 

* 

tnlargestpiece = 0; 
sptrW-; 
lcount-H-; 
/♦NEW*/ 

hold_this_placejsptr « sptr; 
hold__thisjcount = lcount; 
eptr = hold_this_place_eptr; 
ecoimt = holdJhis_ecount; 

1 } 
} else { 

tnlargestpiece = 0; 
break; 

} 

} * 
tnlaigestpiece = 0; 
eptr = hold Jhis jplace_eptr + 1 ; 
ecount = hold Jhis^ecount + 1 ; 
hold_this_place_eptr = eptr, 

hold_this_ecount = ecount; * 

} 

if(DEBUG_ALIGN) §)rintfl[stdout,"ALIGN: largest piece match is %d monomers from 
%s\n"^ilarges^)iece,librarytilib] .name); 

if(DEBUG_ALIGN) ^)rintf(stdout,"ALIGN: largestpiece_ecount=%d, 
largestpieceJcount=%d\n n , 

largestpiece^ecounUargestpieceJcount); 

if(nlargestpiece >= smaUest_acceptablejpiece) { 

if(DEBUG_ALIGN) ^rint^stdou^^IGN: incoiporated\n"); 
lcount = largestpiecejcount; 
ecount = largestpiece_ecount; 
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while(ecount < (nlargestpiece + largestpiece_ecount)) { 

epothilone->alignedsequence[ecount] = 
Kbrary[ilib].monomersequence[lcoimt]; 

stn^y(epoMone->alignedPKSnm 
strc^y(epothilone->context[e^ 
epothilone->marked[ecount] =TRUE; 

if(ecount < (nlargestpiece + largestpiece_ecount - 1)) epothilone- 
>boundarytoright[ecoimt] = FALSE; 

lcount++; 
ecount++; 

} 

> 

return (nlargestpiece); 
}/*maximal_adjacent_alignment*/ 

/* . 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of the largest maximal adjacent set of monomers inserted. 
PROCEDURE: 

*/ 

int maximal_adjacent_alignment_and_dunq)( 
LIB *epothilone, 
int nwildcard, 

char wUdcards[MAXWILD] [MAXLEN], 
LIB *library, 
int ilib, 

int smallest_acceptable_piece) 
{ 
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int ii=0,jj=0,kk=0; 

int ecount=0,lcount=0; 

int elen=0; 

int epotbilonelen=0; 

int nlargestpiece=0,tnlargestpiece=0; 

int hold_thisjcount=0, holdjhis_ecount=0; 

int wildcardmatch»FALSE; 

char *wptr; 

char *largestpiecejsptr,*largestpiece_eptr; 

char *hold_this_place_eptr, *holdJhisj3lace_sptr; 

int largestpiece__lcount=0,largestpiece__ecount=0; 

char *sptr, *eptr> *lptr,*bufptr, 

ifpEBUG^WILDCARD) { 
if{nwildcard>0) { 

fprintf(stdouCmaximal_: wildcards[0]=%s\n H ,wildcards[0]); 

} 

} 

fprintf(stdout, n maxirnal_adjacent_aHgnment_and_dump: 
smaUest_acceptable^iece=%d\n\s^^ 

sptr = library[ilib].monomersequence; 

eptr = epothilone->monomersequence; 

elen = strlen(epothilone->monomersequence); 

ecount=0; 

lcount=0; 

nlargestpiece=0; 

tnlargestpiece=0; 

hold Jhis_place_eptr = eptr; 

hold_this_ecount = ecount; 

while (*eptr 



sptr = library(ilib].monomersequence; 
lcount = 0; 
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/* NEW */ 
. hoId_this_place_sptr = sptr; 
hold_Jhis_lcount = lcount; 
wildcardmatch » FALSE; 

while(*sptr!= , \0 , ){ 

wildcardmatch = FALSE; 
if(epottalone->marked[ecoujnt] — FALSE) { 

V* code for wildcards added MAS 05-16-00 */ 
wptr= n "; 

if(*eptr = W) { wplr = wildcards[0]; } 
else if(*eptr = 'X 1 ) { wptr = wildcaids[l]; } 
else if(*eptr = Y 1 ) { wptr - wildcards[2]; } 
else if(*eptr = 'Z 1 ) { wptr = wildcards[3]; } 

while(*wptr !== \Q f ) { 

if(*wptr — *sptr) { 

wildcardmatch « TRUE; 
break; 

} 

wptrH-; 

} 

if((wildcaidmatch = TRUE) || (*eptr = *sptr)) { 
tnlargestpiece++; 

if(DEBUG_ALIGN) fprintf(stdout, "FOUND a match: 
len=%d, epo(%d, %c), Ub[%d].name=%s (%d, %c)\n M , 

tnlargestpiece, ecoimt, *eptr, 

ihb,library[ilib].name,lcoimt, *sptr); 

if(tnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
largestpiecejsptr = hold_this_place_sptr, 
largestpiece Jcount = hold_this_lcount; 
largestpiece^eptr = holdJhis_place_eptr, 
largestpiece_ecount = hold_this_ecount; 



WO 01/92991 PCT/US01/17352 

87 

if(DEBUG_ALIGN) fprintf(stoIout, "FOUND a 
largest piece: len=%d, epo(%d, %c), lib(%d, J4o)tf, 

nlargestpiece, largestpiece_ecount, 
*largestpiece_eptr, largestpiecejcount, *largestpiece_sptr); 

} 

sptrH-; 
lcount-H-; 
eptrH-; 
ecount++; 

} e l ge { 

tnlargestpiece - 0; 

sptrf+; 

lcount-H-; 

hold_this_place_sptr- sptr, 
holdjhis Jconnt = lcount; 
/*NEW*/ 

eptr = hold_this_place_eptr, 
ecount = hold_this_ecount; 

} 

} else { 

tnlargestpiece = 0; 
break; 

} 

} 

tnlargestpiece = 0; 
eptr » hold_this_place_eptr + 1; 
ecount = holdJhis_ecount + 1; 
hold__this_place_eptr = eptr; 
hold_this_ecount = ecount; 
if(DEBUG_AUGN) { 

Q>rintf(stdout,"increnienting hold_this_place_eptr=%c, 
hoidJtas_ecount^dV,*holdJW^^ 
} 



} 
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if(DEBUG_ALIGN) $rintf(stdout, ,f ALIGN: largest piece match is %d monomera fioro 
%sW'^argestpiece^brary[iKb].name); 

if(DEBUG_ALIGN) ftjrintf^stdou^^ALIGN: largestpiece__ecount=%d, 
largestpieceJcount=%d\n", 

largestpiece_ecount,largestpiece_lcount); 

if(nlargestpiece >= smallest_acceptable_piece) { 

if(DEBUG_ALIGN) fprintf(stdouCALIGN: incoiporated\n"); 

lcount = largestpiece_lcount; 
ecount » largestpiece_ecount; 



4>rintf[stdout, M ALIGN_TARGET: n ); 
for(iH); ii<largestpiece_ecount; ii++) { 
fprintf(stdout, " "); 

} 

while(ecount < (nlargestpiece + largestpiece_ecount)) { 



epothilone->alignedsequence[ecount] = 
libraryfilib] .monomersequenceflcount]; 

strcpy(epothttone->aUgnedPK^^ 
strcpy(epothilone->context[eco^^ 
epothilone->marked[ecouiit] = TRUE; 

if(ecount < (nlargestpiece + largestpiece_ecount - 1)) epothilone- 
>boundarytoright[ecount] = FALSE; 

fpimt^stdouV^c^kT)^ 

lcount++; 
ecount-H-; 

} 



for(ii==ecount; ii<elen; ii++) { 
Q)rfntf(stdout, M "); 
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} 

fyrint^stdout^yosXn^^braryfilibljiaine); 

} 

return (nlargestpiece); 
} /*maximal_adj acent_alignment_and_dump*/ 

/* 

PURPOSE: 
INPUT: 
OUTPUT: 
PROCEDURE: 

*/ 

int dutput_fresh_alignment( 

LIB *epotemp, 

int nboundary_cuto£f) 

{ 

int acount=0,ecount=0; 

int longest_segmentlen=0,current_segmentlen=0; 

char *aptr,*eptr; 

char boxmdaiy[MAXNAMELEN]; 

\ 

eptr = epotemp->monomersequence; 
ecount = 0; 

epotemp->nboundary = 0; 
longest_segmentlen = 0; 
currentjsegmentlen = 0; 

strcpy(boundary,epotemp->alignedPKSname[ecount]); 
while(*eptr!='\0 , ){ 

if(epotemp->boundarytoright[ecount] = TRUE) { 
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epotemp->nbotmdary++; 
if(current_segmentlen > longestjegmentlen) { 

longest_segmentlen = current_segmentlen; 

} 

current_segmentlen = 0; 

} 

current jsegmentlen++; 

ecount-H-; 

eptrH-; 

} 

if(current_segmentlen > longest_segmentlen) { 

longestjsegmentlen = current_segmentlen; 

} 

if(epotemp->nboundaiy > nboundary_cutofi) return 1; 

eptr = epotemp->monomersequence; 
ecount = 0; 

fprintf(stdouV'HIT "); 
while(*eptr != M*) { 

if(epotemp->aJignedPKSname[ecount][0] = { 

if(epotemp->boxmdarytoright[ecount] = TRUE) { 

^rintj^stdouV^/ocrTARGCyos)! " *eptr > epotemp- 

>context[ecount]); 

} else { 

^print^stdout/yocrTARGCyos) " *eptr,epotemp- 

>context[ecount]); 

} 

} else { 

if(epotemp~>boundarytoright[ecoiint] = TRUE) { 

^)rintf(stdouV ,0 /oc:%4s(%s)[ M ,*eptr,epotemp- 

>alignedPKSruune[eco^ 

} ^se { 

§)rintf(stdout, ,, %c:%4s(%s) f, ,*eptr,epotemp- 
>alignedPKSname[ecoimt] > epotemp->context[ecount]); 
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} 

} 

ecount++; 
eptrH-; 

> 

lprintf(stdout,"%d %d ! \epotemp->nbotmdary,longest__segmentlen); 
^rintf(stdout, m, ); 

eptr = epotemp->monomersequence; 
acount = 0; 
ecount = 0; 
while(*eptr l-W) { 

if(epotemp->boundarytoright[ecount] = TRUE) { 

fyrintf(stdout/%c| f ^q)otemp->c»ntext[acx)unt][l]); 

}else{ 

J5jrintf(stdout 5 "%c , \qx)temp->cont^ 

} 

eptr++; 

acount-H-; 

ecount-H-; 

> 

fyrintfltstdouVV); 
return. 1; 

}/*output_firesh_alignment*/ 

/* _ — 

*/ 

int getJKbrary( 

char *libraryfile, 

LB *library, 

int inter_modiilar_db_flag_off) 
{ 
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int ii=0jj=0^ck=0,lcount=O; 

int nlib=0; 

char *bu^tr,buf[MAXBUF]; 

char tmonomersequence[MAXNAMELEN]; 

char *lptr,*tptr, 

FILE *libp; 

for(kk=0; kk < MAX_LIB_ENTRIES; kk++) { 
library[kk].reciirsion_tagged = FALSE; 
for(ii=0; ii<MAXNAMELEN; ii++) { 

Ubrary[lde].name[ii] - W; 

library[kk] .monomersequence[ii] = \0'; 

library[kk].alignedsequence[ii] = W; 

for(ij=0;jj<MAXLBN;ij++) { 

Ubrary|lck].alignedPKSnanie[jj][ii] = *\0'; 
Kbrary[kk].marked[jj] = FALSE; 
KbraryPckJ.contextljjitO] = \0'; 
Kbrary[kk].context[ij][l] = W; ' 

Kbrary[kk].context|M2] = 
hT>rary{kk].context[jj][3] = \0'; 

} 

> 

Ubrary[kk].nboundary = 0; 

} 

/* read in the library from PKS.lib*/ 
i^(NULl^^bp==fopen(Ubraryfile, ,, r ,, ))) { 

$rintf(stdout,"TRY AGAIN; couldnt open library file: %s\n",Ubrarjifile); 
• nlib=0; 

exitO; 

} 

nlib=0; 

while(nlib < MAX JJB JBNTRIES) { 
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bufptr-buf; 

if(*bulptr = continue; 
if(*bufptr != W){ 

/* 

s$c*nf(bulptr, "%s %s 
%s ,r 4ibrary[nlib].name,tmonomersequence > Ubrary[nKb].am 
*/ 

Iptr = library[nUb].name; 

wMle((*bufi*r != ' (*bufbtr != W) && (*bufptr != V)) { 

*lptr^ = *bufptr++; 

} 

*lptr=W; 

if((*bufptr N -NO 1 ) && (*bufptr != W)) bu$tr++; 



lptr = library[nlib].mononiersequence; 

wbile((*bufi5tr != ' 0 && ("bufptr != •\0') && (*bufptr != V)) { 
/* 

This code specifically deletes inter-modular double bonds when the -d 

option is set. 

*/ 

, ifl[inter_inodular_db_fla£_o£f = TRUE){ 
if(*bufptr !='=•){ 

*lptr++ = *bufiptr++; 

} else { 

bufptr++; 

> 

} els e { 

nptr++ = *bufptr++; 

J 

} 

♦lptr^O 1 ; 

if((*bufptr != •\0') && pbufptr != V)) bufiitrH-; 



WO 01/92991 



94 



PCT/US01/17352 



lptr « Ubrary[nlib].aiinotatedsequence; 

while((n>ufptr !=' •) && (*bufptr != -VO 1 ) && (*bufptr N W)){ 
*lptH+ = *bufbtr-H-; 

} 

*lptr = «\0'; 

if((*bufptr != *\0') && (*bufptr != W)) bufbtrH-; 

fprmtf(stdout/IJBRARY(%d) %s: 
%sW^nKbJibrary[iilib].name4ib^ 

Q)rintf(stdout,"LlBRARY(%d) %s: 
%s\n"^b4iT>rary[nlib].nam^^ 

nlib++; 

} 

} 

fclose(libp); 

for(kk=0; kk< nlib; kk+f) { 
lcount = 0; 

lptr = library[kk] .monomersequence; 
while(*lptr!= , \0 , ){ 

if(lcount = 0) { 

library[kk]xontext[lcount][0] = 

libraiy[kk]xontext[lcount][l] = *lptr; 

Hbraiy^]xontext[lcount][2] = *(]ptr+ 1); 

} else { 

i£(lcount = (strlen(library[kk] .monomersequence) - 1)){ 
mjrar){kk].context[lconnt][0] = *(lptr - 1); 
library[kk].context[lcoxint][l] = *lptr; 
Ubrary[kk].context[lcoimt][2] ■ 

} else { 

Iibrary[kk].context[lcount][0] = *(lptr - 1); 
Ubrary[kk].context[lcount][l] « *lptr; 
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Ubrai7[kk].context[Icouiit][2] = *(lptr + 1); 

} 

} 

library[kk].context[lcoiint][3] - W; , 

lptrH-; 

lcount++; 

} 

fyrintf(st(lout, ,f LIBRARY(%d) %s: 
%sVJck,Ubiary(Tkk]^ 

^rintf(stdout, w LIBRARY(%d) %s: 
%sW^kk,Hbrai7|Tdc].n^^ 

foi(y^;jj<strlen(Ubrary[kk}.monomersequence);jj^ * 
^rintf(stdout, , X%s)^Ubra^yltt]xontextlij]); 

} 

fprmtffctdouVV 1 ); 



retunmlib; 
}/*getJftrary*/ 



/* 

*/ 

int dump_STARTER_align( 

LIB epothilone, 

int nwildcard, 

char wildcards|>IAXWIII)][MAXLEN] 

) 

{ 

int elen=0; 

int ecount=0,hold_ecount=0; 

int wildcardmatch=FALSE; 

char *sptr,*eptr *wptr, 

elen - strlen(epothilone.monomersequence); 



WO 01/92991 



96 



PCT/US01/17352 



eptr = epothilone.monomersequence; 
^)rintfl[st<Iout, "ALIGN.TARGET: 
whfle(*eptr!=^ f ){ 

^rint^stdou^^yoc^^eptr); 

eptrH-; 

} 

$rint^stdout, n \a n ); 
econnt=0; 

eptr = epothilone jnonomeraequence; 
sptr = epothilone.alignedsequence; 
Q)rintf(stdout, "ALIGN^TARGET: "); 
while(*eptr!= , \0 ? ){ 

wildcardmatch = FALSE; 

wptr=""; 

if(*eptr === 'X*) { wptr - wildcards[0]; } 
else if(*eptr = T 1 ) { wptr - wildcards[l]; } 
else if(*eptr = % Z!) { wptr = wildcards[2]; } 



while(*wptr!== , \0 , ){ 

ifl(*wptr=*sptr){ 

wildcardmatch = TRUE; 
break; 

> 

wptr++; 

} 

if(wildcardmatch = TRUE) { 
^>rintf(stdout; f :"); 

} else { 

if(*eptr = *sptr){ 

ftrintffctdouVT'); 

} else { 

fprintf(stdout, " M ); 

} 

} 
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eptrH-; 
sptrf+; 

} 

^rint^stdout,^"); 

iprintf(stdout, "ALIGNTARGET: "); 
eptr = qpothiloncmonomersequence; 
ecount=0; 

whtt<K*eptr!='\O l ){ 

if(epotfailone.alignedsequence[ecount] = \Q*) { 
ftrintf^dout," "); 

}else{ 

fpriatffatdouV^c^epo^ 
hold_ecount = ecount; 

} 

eptr I I ; 
ecount-H-; 

} 

§mntf(stdout," %s\n'\epothilone.aligne(IPKSname[hold__ecount]); 
^riatf{stdout,"STARTER_ALIGN:\n"); 

}/*dump_STARTER_aliga*/ 

/* 

*/ 

int reset_epotemp( 
LIB *epotemp, 
LIB qxrfhilone) 
{ 

int jj=0, elen=0; 

elen = strlen(epothilone.monomersequence); 

strcpy(q)otemp->name,epothilone.naine); 

strcpy(epotemp->monom^^ 

s1xcpy(epotemp->aligneds^ 
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epotemp->nboundary = 0; 
for(jj=K); jj< elen; jj++) { 

sti^y(epotemp->aUgnedPKS^ 

epotemp->marked[jj] = epothilone.marked[jj]; 

strcpy(^otemp->context|jj],epotWlcme.con^ 

epotemp->boundarytoright(jj] = epothacme.bomdarytorightQj]; 

} 

}/*reset_epotemp*/ 

Thus, the present invention provides a useful means to generate new PKS genes and 
corresponding enzymes to produce polyketides. The invention having now been described by 
way of written description and examples, those of skill in the art will recognize that the 
invention can be practiced in a variety of embodiments and that the foregoing description and 
examples are for purposes of illustration and not limitation of the following claims. 

i 
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WHAT IS CLAIMED IS: 

1 . A method for representing the structure of a polyketide produced by a modular 
polyketide synthase, said method comprising the steps of: 

(a) defining a set of monomer units of which said polyketide is 
composed, 

(b) assigning an alphanumeric symbol or symbols to each different 
monomer unit in said set, 

(c) identifying one or more monomers in said set that is present in said 
polyketide, and 

(d) composing a string of said symbols ordered in a manner reflecting 
the order in which said monomers occurs in said polyketide, wherein said string 
of symbols represents the structure of said polyketide. 

2. The method of claim 1, wherein said monomer set comprises two-carbon unit monomers, 
wherein a first carbon of said unit is substituted with hydrogen or methyl, and a second carbon 
of said unit is substituted with oxygen, hydroxy, or hydrogen, and said two carbon unit 
comprises either a single or a double bond between said first and second carbons. 

3. The method of claim 2, wherein said monomer set additionally comprises one or more 
members selected from the group consisting of two carbon unit monomers in which said first 
carbon is substituted with hydroxy, methoxy, or ethyl; a moiety corresponding to an amino acid 
or amino acid derivative incorporated into a PKS by a non-ribosomal peptide synthase; a moiety 
corresponding to a structure incorporated into a polyketide by an AMP ligase or a CoA ligase; 
and a moiety corresponding to a structure corresponding to a structure in a polyketide after 
modification by a polyketide modification enzyme. 
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4. The method of claim 2 wherein the set of monomer unit and corresponding symbol 
comprises: 

OH OH OH OH 

J^Y =A; -^v^ =B; ^y= 0; ^ / =D; 



OH OH 




o 

A/ =| ; -^Y=J; -"Y= K ; /^ = l; 




= M; and = N 



5 . The method of claim 4 wherein the set of monomer unit further comprises a 
miscellaneous monomer that is assigned the symbol Q. 

6. The method of claim 4 wherein the set of monomer unit and corresponding symbol 
further comprises 

8 H OH OH 

s\' mA 'i A/ =B "; A^=C5 

R R R 



OH 



R 



R ft R 
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7. A database of polyketides, in which each said member is represented by a string of 
alpha-numeric symbols, wherein said symbols represent structural subunits of said polyketzde, 
and said string represents the order in which such subunits occur in said polyketide. 

* 

8. The database of claim 7 that includes at least 1 00 different polyketides. 

9. The database of claim 7 wherein each said member is represented by a CHUCKLES 
string. 

10. The database of claim 7 wherein each said member is represented by an annotated 
CHUCKLES string. 

1 1 . The database of claim 7 wherein the symbol and its corresponding structural subunit are 
selected from the group consisting of 

OH OH OH .QH 




E = 



OH 




OH 



; G = 



0 * 



o 



o 





and Q for a miscellaneous monomer. 
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12. The database of claim 7 wherein the symbol and its corresponding structural subunit are 
selected from the group consisting of 

OH OH OH OH 



OH OH 



0 O 



; l = 

I 



OH OH 

M= ^V^5 N= ; A = ; B = X ^V^ ; 

1 R R 

OH OH o O 

a 

R 



OH OH o O 



R R r 



and Q for a miscellaneous monomer. 

13. A database of polyketides, in which each said member is represented by a linearized 
representation of said polyketide. 

1 4. A method of designing a PKS gene capable of producing a desired polyketide, which 
method comprises: 

(a) defining a string of alphanumeric symbols representing the structure of said 
polyketide, 

(b) comparing said string to a database of strings of alphanumeric symbols 
representing polyketides produced by PKS genes, 
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(c) identifying common elements in said string representing the structure of said 
polyketide with elements in said strings in said database, and 

(d) generating one or more new strings from elements identified in step (b) that 
match said string representing the structure of said polyketide, wherein said new string defines a 
PKS gene capable of producing said polyketide. 

15. The method of claim 14, wherein all possible PKS genes encoding a desired polyketide 
from said database are generated and displayed. 

16. The method of claim 14, wherein said new strings generated in step (d) are rated and 
displayed in an order based on one or more parameters. 

17. The method of claim 16, wherein said parameters are selected from the group consisting 
of number of non-native module interfaces and number of non-native protein interfaces. 
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V 



(KS t ATl f KRl,ACP) 
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HO 
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\£fc 

< a 



(KS.ATUCIOACP) 



& D 

(KS^VTCJCRZACP) 



E 

(KS^T3JCRMCP) 



(KS^T3 r KIOACP) 



(KS,ATl,ACP) 



(KS,AT2,ACP) 



AT1: methylmalonyl-CbA © 
AT2: methylmalonyl-CoA @ © 
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KR1: hydroxy! ® 
KR2: hydroxy! @@ 

@ @ = counter clockwise 
@ = clockwise 



4 



I 

(KS.AT3.ACP) 



(KS.ATl JDH^R.KR^CP) 
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(KSw\T2^H.ER4CR^CP) 



(K5 f AT3,DH f £R > KR ( ACF) 



\_M 
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^ <KS,AT3»DH,KR»ACP) 



Figure 2 
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CHUCKLES: ADGJDD 
SMILES: 

Cl(=OHC@HJ(C)IC@@H](OH)-[C@®H](C)[C@@Hl(OH)- 

[C@®H](C)C-[C@@H](C)C(=0)-[C@H](C)[C@@H](C)- 

[C@@H](C)[C@@H](CC)01 



Figure 3 
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Figure 4A 
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Figure 4C 
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Figure 4E 
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Figure 5 
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